Elastic model serving via efficient autoscaling

摘要
AI workloads like model serving are inherently dynamic. Autoscaling is a promising technique for elastic model serving to handle such dynamic, but cold start overheads limit its effectiveness. In this talk, I will discuss our recent work on accelerating cold starts. Key insights include leveraging OS primitives (GPU checkpoint and restore) to eliminate container startup costs and using advanced networking and model execution features to overcome GPU parameter loading limits. Our first system, PhoenixOS, can boot a Llama-2 13B inference instance on one GPU in 300ms, 10–96× faster than state-of-the-art solutions like cuda-checkpoint and Docker. Our second system, BlitzScale further enhances throughput without full parameter loading or altering the inference process (e.g., no early exit) through a model-system-network co-designed approach.
演讲者简介
Xingda Wei is a tenure-track Assistant Professor at Shanghai Jiao Tong University. His main research interests include improving the performance, reliability, and resource efficiency of system support for AI. He has published papers in conferences including OSDI/SOSP, Eurosys, and NSDI. He has received awards including the Eurosys 2024 Best Paper Award, 2022 Huawei OlympusMons Award, and 2021 ACM SIGOPS Dennis M. Ritchie Award. He serves on the program committees of multiple leading system conferences including OSDI, ASPLOS, NSDI, and as the program committee chair of ACM ChinaSys.
日期
06 March 2025
时间
15:00:00 - 16:00:00
地点
E1-201 (HKUST-GZ)
Join Link
Zoom Meeting ID: 910 5670 9161
Passcode: dsat
主办方
数据科学与分析学域
联系邮箱
dsat@hkust-gz.edu.cn