Sparse Attention in Large Language Models: A Survey
The Hong Kong University of Science and Technology (Guangzhou)
数据科学与分析学域
PhD Qualifying Examination
By Mr. YIN Hongbo
摘要
Efficient long-context processing in large language models (LLMs) is hindered by the high computational and memory costs of self-attention. Prefilling incurs quadratic complexity with input length, while decoding suffers from growing key–value (KV) caches that strain memory and bandwidth. Sparse attention addresses these issues by selectively computing attention links, preserving model quality while reducing overhead. This sur vey provides a structured taxonomy of sparse attention methods, dividing them into static patterns (e.g., sliding windows, global tokens) and dynamic strategies. Dynamic methods include attention score selection and KV cache management, enabling adaptive sparsity during inference. We further explore future directions, highlighting the need for task-aware and performance-sensitive sparsity, as well as adaptive control based on runtime conditions, to support robust and scalable LLM deployment.
PQE Committee
Chair of Committee: Prof. LUO Qiong
Prime Supervisor: Prof. CHEN Lei
Co-Supervisor: Prof. ZHANG Yongqi
Examiner: Prof. LI Lei
日期
09 June 2025
时间
13:00:00 - 14:00:00
地点
E1-149 (HKUST-GZ
Join Link
Zoom Meeting ID: 969 4767 4160
Tencent Meeting ID:
dsa2025