Sparse Attention in Large Language Models: A Survey

博士资格考试

The Hong Kong University of Science and Technology (Guangzhou)

数据科学与分析学域

PhD Qualifying Examination

By Mr. YIN Hongbo

摘要

Efficient long-context processing in large language models (LLMs) is hindered by the high computational and memory costs of self-attention. Prefilling incurs quadratic complexity with input length, while decoding suffers from growing key–value (KV) caches that strain memory and bandwidth. Sparse attention addresses these issues by selectively computing attention links, preserving model quality while reducing overhead. This sur vey provides a structured taxonomy of sparse attention methods, dividing them into static patterns (e.g., sliding windows, global tokens) and dynamic strategies. Dynamic methods include attention score selection and KV cache management, enabling adaptive sparsity during inference. We further explore future directions, highlighting the need for task-aware and performance-sensitive sparsity, as well as adaptive control based on runtime conditions, to support robust and scalable LLM deployment.

PQE Committee

Chair of Committee: Prof. LUO Qiong

Prime Supervisor: Prof. CHEN Lei

Co-Supervisor: Prof. ZHANG Yongqi

Examiner: Prof. LI Lei

日期

09 June 2025

时间

13:00:00 - 14:00:00

地点

E1-149 (HKUST-GZ

Join Link

Zoom Meeting ID:
969 4767 4160

Tencent Meeting ID:
dsa2025