Heterogeneity-Aware Automatic Parallelization and Scheduling for Large-Model Training: A Survey
The Hong Kong University of Science and Technology (Guangzhou)
数据科学与分析学域
PhD Qualifying Examination
By Mr. YIN, Yiming
摘要
Large-model training is increasingly shaped by heterogeneity in model structure, data, and hardware resources. Classical distributed training systems and automatic parallelization methods often assume regular Transformer graphs, stable input shapes, and homogeneous GPU clusters. These assumptions become fragile for modular multimodal models, post-training workflows, variable-shaped data, rollout imbalance, and heterogeneous or geo-distributed clusters.
This survey reviews heterogeneity-aware automatic parallelization and scheduling for large-model training. It organizes recent systems along three axes: execution-graph heterogeneity, data heterogeneity, and hardware or cluster heterogeneity. Across these settings, efficient training requires joint reasoning about partitioning, placement, scheduling, data orchestration, communication, and cost modeling. The survey concludes by discussing open problems in joint planning and dynamic cost modeling.
PQE Committee
Chair: Prof. TANG, Nan
Prime Supervisor: Prof. CHU, Xiaowen
Co-Supervisor: Prof. TANG, Jing
Examiner: Prof. WEN, Zeyi
日期
02 July 2026
时间
09:00:00 - 10:00:00
地点
E3-201, HKUST(GZ)
主办方
数据科学与分析学域
联系邮箱
dsarpg@hkust-gz.edu.cn