Elastic model serving via efficient autoscaling

DSA学域研讨会

摘要

AI workloads like model serving are inherently dynamic. Autoscaling is a promising technique for elastic model serving to handle such dynamic, but cold start overheads limit its effectiveness. In this talk, I will discuss our recent work on accelerating cold starts. Key insights include leveraging OS primitives (GPU checkpoint and restore) to eliminate container startup costs and using advanced networking and model execution features to overcome GPU parameter loading limits. Our first system, PhoenixOS, can boot a Llama-2 13B inference instance on one GPU in 300ms, 10–96× faster than state-of-the-art solutions like cuda-checkpoint and Docker. Our second system, BlitzScale further enhances throughput without full parameter loading or altering the inference process (e.g., no early exit) through a model-system-network co-designed approach.

演讲者简介

Xingda Wei is a tenure-track Assistant Professor at Shanghai Jiao Tong University. His main research interests include improving the performance, reliability, and resource efficiency of system support for AI. He has published papers in conferences including OSDI/SOSP, Eurosys, and NSDI. He has received awards including the Eurosys 2024 Best Paper Award, 2022 Huawei OlympusMons Award, and 2021 ACM SIGOPS Dennis M. Ritchie Award. He serves on the program committees of multiple leading system conferences including OSDI, ASPLOS, NSDI, and as the program committee chair of ACM ChinaSys.

日期

06 March 2025

时间

15:00:00 - 16:00:00

地点

E1-201 (HKUST-GZ)

Join Link

Zoom Meeting ID:
910 5670 9161

Passcode: dsat

主办方

数据科学与分析学域

联系邮箱

dsat@hkust-gz.edu.cn