Heterogeneity-Aware Automatic Parallelization and Scheduling for Large-Model Training: A Survey

博士资格考试

Heterogeneity-Aware Automatic Parallelization and Scheduling for Large-Model Training: A Survey

The Hong Kong University of Science and Technology (Guangzhou)

数据科学与分析学域

PhD Qualifying Examination

By Mr. YIN, Yiming

摘要

Large-model training is increasingly shaped by heterogeneity in model structure, data, and hardware resources. Classical distributed training systems and automatic parallelization methods often assume regular Transformer graphs, stable input shapes, and homogeneous GPU clusters. These assumptions become fragile for modular multimodal models, post-training workflows, variable-shaped data, rollout imbalance, and heterogeneous or geo-distributed clusters.

This survey reviews heterogeneity-aware automatic parallelization and scheduling for large-model training. It organizes recent systems along three axes: execution-graph heterogeneity, data heterogeneity, and hardware or cluster heterogeneity. Across these settings, efficient training requires joint reasoning about partitioning, placement, scheduling, data orchestration, communication, and cost modeling. The survey concludes by discussing open problems in joint planning and dynamic cost modeling.

PQE Committee

Chair: Prof. TANG, Nan

Prime Supervisor: Prof. CHU, Xiaowen

Co-Supervisor: Prof. TANG, Jing

Examiner: Prof. WEN, Zeyi

日期

02 July 2026

时间

09:00:00 - 10:00:00

地点

E3-201, HKUST(GZ)

主办方

数据科学与分析学域

联系邮箱

dsarpg@hkust-gz.edu.cn