Similarity Representation and Efficient Indexing in Vector Databases for Language Models

博士资格考试

Similarity Representation and Efficient Indexing in Vector Databases for Language Models

The Hong Kong University of Science and Technology (Guangzhou)

数据科学与分析学域

PhD Qualifying Examination

By Mr. QIU, Runwen

摘要

Vector databases play a central role in Retrieval-Augmented Generation (RAG) systems by enabling efficient similarity search over large document collections. Since both queries and documents are typically textual, retrieval must accurately model semantic similarity between them, which is inherently challenging. In practical RAG systems, corpora often contain millions or even billions of documents, making exhaustive similarity computation computationally prohibitive and causing unacceptable query latency. Consequently, two key components have become central to modern vector databases: similarity representation and efficient indexing. Although both topics have been extensively studied in the AI and database communities, existing surveys usually discuss them separately and rarely analyze them from the perspective of RAG. Therefore, this survey provides a systematic overview of similarity representation and efficient indexing techniques for vector databases in RAG systems. We categorize existing methods into sparse and dense paradigms based on whether document representations are sparse or dense vectors. In addition, we discuss several open challenges in building efficient and downstream-aware vector databases.

PQE Committee

Chair: Prof. TANG, Nan
Prime Supervisor: Prof. TANG, Jing
Co-Supervisor: Prof. CHEN, Wei
Examiner: Prof. DING, Ningning

日期

10 June 2026

时间

14:00:00 - 15:00:00

地点

E1-148, HKUST(GZ)