Compression-Driven Optimization for Efficient Large Language Model Inference: A Survey

博士资格考试

Compression-Driven Optimization for Efficient Large Language Model Inference: A Survey

The Hong Kong University of Science and Technology (Guangzhou)

数据科学与分析学域

PhD Qualifying Examination

By Mr. HUANG, Xinhao

摘要

The inference efficiency of Large Language Models (LLMs) is significantly constrained by their inherent parameter redundancy, high computational complexity, and substantial memory consumption. Optimizing LLM inference is thus a critical research frontier in generative AI. While model compression is a pivotal approach to enhance efficiency, achieving an optimal balance between compression efficacy and model performance in resource-constrained environments remains an underexplored challenge. This survey provides a comprehensive and categorized review of state-of-the-art LLM compression techniques, focusing on four key dimensions: model weight compression, optimization of Key-Value (KV) cache during inference, dynamic compression ratio allocation strategies, and system-level performance tuning for LLM inference systems. Building on a comprehensive analysis of existing work, this survey proposes a novel research framework, encompassing four core components: “Fine-grained Low-Rank Decomposition,” “Efficient Attention Key-Value Cache,” and “Adaptive Budget Allocation.” This proposed framework aims to provide novel methods for efficient LLM inference and deployment in resource-constrained environments, thereby offering crucial support for the practical application of generative AI technologies across various critical domains.

PQE Committee

Chair of Committee: Prof. LUO Qiong

Prime Supervisor: Prof. WEN Zeyi

Co-Supervisor: Prof. GAN Zecheng

Examiner: Prof. TANG Guoming

日期

09 June 2025

时间

16:00:00 - 17:00:00

地点

E1-147 (HKUST-GZ)

Join Link

Zoom Meeting ID:
948 2448 3923

Passcode: dsa2025