Sparse Attention in Large Language Models: A Survey
The Hong Kong University of Science and Technology (Guangzhou)
Data Science and Analytics Thrust
PhD Qualifying Examination
By Mr. YIN Hongbo
Abstract
Efficient long-context processing in large language models (LLMs) is hindered by the high computational and memory costs of self-attention. Prefilling incurs quadratic complexity with input length, while decoding suffers from growing key–value (KV) caches that strain memory and bandwidth. Sparse attention addresses these issues by selectively computing attention links, preserving model quality while reducing overhead. This sur vey provides a structured taxonomy of sparse attention methods, dividing them into static patterns (e.g., sliding windows, global tokens) and dynamic strategies. Dynamic methods include attention score selection and KV cache management, enabling adaptive sparsity during inference. We further explore future directions, highlighting the need for task-aware and performance-sensitive sparsity, as well as adaptive control based on runtime conditions, to support robust and scalable LLM deployment.
PQE Committee
Chair of Committee: Prof. LUO Qiong
Prime Supervisor: Prof. CHEN Lei
Co-Supervisor: Prof. ZHANG Yongqi
Examiner: Prof. LI Lei
Date
09 June 2025
Time
13:00:00 - 14:00:00
Location
E1-149 (HKUST-GZ
Join Link
Zoom Meeting ID: 969 4767 4160
Tencent Meeting ID:
dsa2025