Similarity Representation and Efficient Indexing in Vector Databases for Language Models
The Hong Kong University of Science and Technology (Guangzhou)
数据科学与分析学域
PhD Qualifying Examination
By Mr. QIU, Runwen
摘要
Vector databases play a central role in Retrieval-Augmented Generation (RAG) systems by enabling efficient similarity search over large document collections. Since both queries and documents are typically textual, retrieval must accurately model semantic similarity between them, which is inherently challenging. In practical RAG systems, corpora often contain millions or even billions of documents, making exhaustive similarity computation computationally prohibitive and causing unacceptable query latency. Consequently, two key components have become central to modern vector databases: similarity representation and efficient indexing. Although both topics have been extensively studied in the AI and database communities, existing surveys usually discuss them separately and rarely analyze them from the perspective of RAG. Therefore, this survey provides a systematic overview of similarity representation and efficient indexing techniques for vector databases in RAG systems. We categorize existing methods into sparse and dense paradigms based on whether document representations are sparse or dense vectors. In addition, we discuss several open challenges in building efficient and downstream-aware vector databases.
PQE Committee
- Chair: Prof. TANG, Nan
- Prime Supervisor: Prof. TANG, Jing
- Co-Supervisor: Prof. CHEN, Wei
- Examiner: Prof. DING, Ningning
日期
10 June 2026
时间
14:00:00 - 15:00:00
地点
E1-148, HKUST(GZ)