Compression-Driven Optimization for Efficient Large Language Model Inference: A Survey
The Hong Kong University of Science and Technology (Guangzhou)
Data Science and Analytics Thrust
PhD Qualifying Examination
By Mr. HUANG, Xinhao
Abstract
The inference efficiency of Large Language Models (LLMs) is significantly constrained by their inherent parameter redundancy, high computational complexity, and substantial memory consumption. Optimizing LLM inference is thus a critical research frontier in generative AI. While model compression is a pivotal approach to enhance efficiency, achieving an optimal balance between compression efficacy and model performance in resource-constrained environments remains an underexplored challenge. This survey provides a comprehensive and categorized review of state-of-the-art LLM compression techniques, focusing on four key dimensions: model weight compression, optimization of Key-Value (KV) cache during inference, dynamic compression ratio allocation strategies, and system-level performance tuning for LLM inference systems. Building on a comprehensive analysis of existing work, this survey proposes a novel research framework, encompassing four core components: “Fine-grained Low-Rank Decomposition,” “Efficient Attention Key-Value Cache,” and “Adaptive Budget Allocation.” This proposed framework aims to provide novel methods for efficient LLM inference and deployment in resource-constrained environments, thereby offering crucial support for the practical application of generative AI technologies across various critical domains.
PQE Committee
Chair of Committee: Prof. LUO Qiong
Prime Supervisor: Prof. WEN Zeyi
Co-Supervisor: Prof. GAN Zecheng
Examiner: Prof. TANG Guoming
Date
09 June 2025
Time
16:00:00 - 17:00:00
Location
E1-147 (HKUST-GZ)
Join Link
Zoom Meeting ID: 948 2448 3923
Passcode: dsa2025