KV Cache Optimization for Efficient Long-ContextLLM Inference: From Algorithms to Systems

PhD Qualifying-Exam

KV Cache Optimization for Efficient Long-ContextLLM Inference: From Algorithms to Systems

The Hong Kong University of Science and Technology (Guangzhou)

Data Science and Analytics Thrust

PhD Qualifying Examination

By Mr. LIAO, Chengxi

Abstract

Mixture-of-Experts (MoE) architectures have become a key design paradigm in modern large language models (LLMs), enabling large-scale parameter growth while maintaining efficient inference through sparse activation. By activating only a subset of experts for each token, MoE models significantly improve capacity and scaling efficiency and have been widely adopted in state-of-the-art systems such as Switch Transformer, Mixtral, Qwen-MoE, and DeepSeek series models. However, MoE inference introduces unique challenges, including dynamic routing, irregular computation, memory overhead, load imbalance, and costly inter-device communication, which are not well addressed by techniques designed for dense LLMs. This survey provide a comprehensive review of recent advances in MoE-LLM inference optimization. It systematically categorizes existing methods into three levels: algorithm-level optimization, system-level optimization, and hardware-level optimization. Algorithm-level approaches focus on improving routing strategies, reducing redundant computation, and compressing expert parameters. System-level techniques aim to enhance runtime efficiency through expert scheduling, communication reduction, memory management, and heterogeneous execution. Hardware-level methods explore specialized accelerators and hardware-software co-design to better support sparse and dynamic computation patterns. Finally, It discuss open challenges and future research directions toward efficient, scalable, and deployment-ready MoE inference across cloud and edge environments.

PQE Committee

Chairperson: Prof. YU, Xu Jeffrey

Prime Supervisor: Prof. WEN, Zeyi

Co-Supervisor: Prof. TSUNG, Fu-Gee

Examiner: Prof. DING, Ningning

Date

09 June 2026

Time

11:00:00 - 12:00:00

Location

E1-147, HKUST(GZ)

Event Organizer

Data Science and Analytics Thrust

Email

dsarpg@hkust-gz.edu.cn