A Survey on Efficient Large Language ModelServing System
The Hong Kong University of Science and Technology (Guangzhou)
Data Science and Analytics Thrust
PhD Qualifying Examination
By Ms. Xuemei Peng
Abstract
Large Language Models (LLMs) have become ubiquitous in various applications, ranging from natural language processing to complex decision-making tasks. With their increasing prevalence, the need for an efficient LLM serving system is critical to ensure both optimal performance and effective resource utilization. This survey explores the latest techniques for optimizing LLM serving, offering a comprehensive overview that includes advancements in memory management, computation optimization, and the development of advanced LLM paradigms. We also present our attempts on improving LLM serving efficiency, featuring innovations such as dynamic request packaging, adaptive GPU resource allocation, and the strategic duplication and merging of pipelines. The experimental results that validate the effectiveness of our approach. Finally, we propose directions for future research in efficient LLM serving, with the goal of further enhancing performance and resource management.
PQE Committee
Chair of Committee: Prof. Xiaowen CHU
Prime Supervisor: Prof. Zeyi WEN
Co-Supervisor: Prof. Xinyu CHEN
Examiner: Prof. Zeke XIE
Date
27 November 2024
Time
15:00:00 - 16:00:00
Location
E3-105
Join Link
Zoom Meeting ID: 945 2523 7448
Passcode: dsa2024