Data Management for Scalable Deep Learning Pipelines Over Data Streams
The Hong Kong University of Science and Technology (Guangzhou)
数据科学与分析学域
PhD Thesis Examination
By Mr. Shihong GAO
摘要
Deep learning over dynamic data streams has emerged as a cornerstone for modern intelligent services, powering critical applications such as fraud detection, personalized social network services, and conversational agents. The streaming nature of data, however, introduces fundamental scalability challenges at the system level that go beyond conventional offline learning. In production, models must undergo continual retraining to adapt to fast-evolving data distributions. Enabling such large-scale retraining with fresh data is crucial for maintaining model relevance, but it is often constrained by limited hardware resources and tight training budgets. At the same time, in the deployment phase, the system must serve incoming requests over a continuous and sometimes bursty data flow. Guaranteeing low-latency and high-throughput inference under these dynamic workloads remains difficult, creating a tension between adaptability and efficiency that current solutions struggle to resolve.
This thesis explores specialized data management techniques for deep learning systems aimed at overcoming scalability challenges. The study focuses on improving the scalability of pipelines processing two representative forms of data streams: graph-structured data (e.g., social networks) and textual data (e.g., user prompts). In the first work, We present ETC, a general framework designed to alleviate key bottlenecks in T-GNN training over large-scale dynamic graphs. At its core, ETC employs a single-pass batch-splitting algorithm that strikes a balance between computational efficiency and information preservation, while offering provable optimality guarantees. In addition, to mitigate input data loading overhead, ETC integrates a three-phase deduplication mechanism together with a lightweight inter-batch pipeline, thereby streamlining data access. In the second work, To further reduce data loading overhead in T-GNN training, we introduce SIMPLE, a system that exploits dynamic data placement strategies. By maintaining a compact GPU buffer for frequently accessed inputs, SIMPLE minimizes redundant data transfers. The system formulates a memory-constrained interval selection problem to guide buffer placement and employs a greedy algorithm with provable approximation guarantees.
Second, we present Apt-Serve, a novel LLM inference serving framework aimed at maximizing effective throughput. The framework leverages a hybrid cache architecture to support larger batch sizes and incorporates an adaptive scheduling mechanism that dynamically adjusts batch composition based on runtime information. We formalize the per-iteration scheduling problem in Apt-Serve as an optimization task, establish its NP-hardness, and propose an efficient greedy algorithm with theoretical guarantees.
We validate the efficiency and effectiveness of our proposed methods through extensive experiments on real-world datasets, comparing against state-of-the-art approaches. The thesis concludes with a discussion of potential directions for future research.
TEC
Chairperson: Prof Danny Hin Kwok TSANG
Prime Supervisor: Prof Can YANG
Co-Supervisor: Prof Lei CHEN
Examiners:
Prof Xiaofang ZHOU
Prof Jia LI
Prof Yongqi ZHANG
Prof Zhiguo GONG
日期
29 September 2025
时间
13:15:00 - 15:00:00
地点
E1-319, HKUST(GZ)
Join Link
Zoom Meeting ID: 99688448171
Passcode: dsa2025
主办方
数据科学与分析学域
联系邮箱
dsarpg@hkust-gz.edu.cn