Designing SLO-Aware Serving Systems for Dependency-Intensive Applications
The Hong Kong University of Science and Technology (Guangzhou)
Data Science and Analytics Thrust
PhD Qualifying Examination
By Ms PENG, Xuemei
Abstract
Modern AI applications increasingly adopt dependency-intensive execution patterns, where requests trigger multi-stage pipelines composed of interconnected models, operators, and external tools. Unlike traditional single-model inference, such applications form complex task graphs whose critical paths, variable arrival patterns, and heterogeneous resource requirements make it fundamentally difficult to provide predictable performance under strict Service-Level Objectives (SLOs). Existing serving systems primarily focus on optimizing standalone model inference or homogeneous operator batches, but they overlook the end-to-end dependencies that shape queueing delays, batching opportunities, and GPU resource contention. As a result, they often fail to meet latency targets when requests interact through shared components or when upstream decisions affect downstream execution. This thesis proposal seeks to develop a unified framework for SLO-aware serving of dependency-intensive applications, combining graph-based workload modeling with adaptive scheduling mechanisms. The key idea is to jointly reason about dependency propagation, batching flexibility, execution deadlines, and GPU resource allocation, enabling the serving system to make globally informed decisions rather than isolated per-operator optimizations. This proposal explores three complementary directions: (1) Dependency-aware batching, which dynamically selects batching boundaries and batch sizes based on SLO slack and graph structure; (2) SLO-guided scheduling and resource allocation, which assigns GPU streams, SM shares, and execution priorities according to critical-path urgency; and (3) Graph-level admission and shaping policies, which reshape request arrival patterns to reduce tail latency while preserving throughput. By combining formal models, algorithm design, and system implementation, this work aims to deliver a serving framework that maintains high throughput while providing predictable SLO satisfaction for complex, multi-stage AI applications. The resulting system will help bridge the gap between the increasing structural complexity of AI applications and the need for reliable, efficient, and cost-effective deployment at scale.
TPE Committee
Chair of Committee: Prof. TANG, Nan
Prime Supervisor: Prof. WEN, Zeyi
Co-Supervisor: Prof. CHEN, Xinyu
Examiner: Prof. XIE, Zeke
Date
10 December 2025
Time
11:00:00 - 12:00:00
Location
E3-201 (HKUST-GZ)