博士资格考试

Efficient KV Cache Management in Large LanguageModel Serving: A Lifecycle-Oriented Survey

The Hong Kong University of Science and Technology (Guangzhou)

数据科学与分析学域

PhD Qualifying Examination

By Mr LI Zeyu

摘要

As LLMs become integral to cloud services and intelligent applications, efficient serving has become a key system-level priority. Among the various system bottlenecks, KV cache management has emerged as a critical challenge, as it enables inference acceleration by reusing previously KV cache. However, the KV cache size grows rapidly with increasing sequence length, model depth, concurrency, and conversation turns, posing significant challenges to memory capacity, scheduling, and system scalability.

In this survey, we introduce a lifecycle-oriented taxonomy for KV cache management, categorizing existing techniques according to three key stages: Creation, Reuse, and Eviction. In the Creation stage, we review methods for KV cache compression, memory allocation, offloading and distributed placement, and usage-efficient attention kernels. In the Reuse stage, we summarize recent advances in cache-aware scheduling, semantic and prefix-based retrieval, and storage & fetch across heterogeneous and distributed memory. In the Eviction stage, we discuss both traditional cache policies and specialized strategies tailored to KV cache management in LLM serving. Through this taxonomy, we find that KV cache management remains an underexplored area, with many open challenges and opportunities for optimization. We hope this survey helps the community better organise existing work, identify key bottlenecks, and inspire future research on efficient KV cache management in LLM serving.

PQE Committee

Chair of Committee: Prof. WANG, Wei
Prime Supervisor: Prof. CHU, Xiaowen
Co-Supervisor: Prof. CHEN, Xinyu
Examiner: Prof. WEN, Zeyi

日期

17 July 2025

时间

11:00:00 - 12:00:00

地点

E3-201 (HKUST-GZ)

Join Link

Zoom Meeting ID:
974 5070 8751


Passcode: dsa2025