Heterogeneity-Aware Automatic Parallelization and Scheduling for Large-Model Training: A Survey
The Hong Kong University of Science and Technology (Guangzhou)
Data Science and Analytics Thrust
PhD Qualifying Examination
By Mr. YIN, Yiming
Abstract
Large-model training is increasingly shaped by heterogeneity in model structure, data, and hardware resources. Classical distributed training systems and automatic parallelization methods often assume regular Transformer graphs, stable input shapes, and homogeneous GPU clusters. These assumptions become fragile for modular multimodal models, post-training workflows, variable-shaped data, rollout imbalance, and heterogeneous or geo-distributed clusters.
This survey reviews heterogeneity-aware automatic parallelization and scheduling for large-model training. It organizes recent systems along three axes: execution-graph heterogeneity, data heterogeneity, and hardware or cluster heterogeneity. Across these settings, efficient training requires joint reasoning about partitioning, placement, scheduling, data orchestration, communication, and cost modeling. The survey concludes by discussing open problems in joint planning and dynamic cost modeling.
PQE Committee
Chair: Prof. TANG, Nan
Prime Supervisor: Prof. CHU, Xiaowen
Co-Supervisor: Prof. TANG, Jing
Examiner: Prof. WEN, Zeyi
Date
02 July 2026
Time
09:00:00 - 10:00:00
Location
E3-201, HKUST(GZ)
Event Organizer
Data Science and Analytics Thrust
dsarpg@hkust-gz.edu.cn