Long-Context LLMs Through the Data Lens: Training, ReaLong-Context LLMs Through the Data Lens: Training, Reasoning, and Evaluation

PhD Qualifying-Exam

Long-Context LLMs Through the Data Lens: Training, ReaLong-Context LLMs Through the Data Lens: Training, Reasoning, and Evaluation

The Hong Kong University of Science and Technology (Guangzhou)

Data Science and Analytics Thrust

PhD Qualifying Examination

By Mr. PENG, Miao

Abstract

Large language models have rapidly expanded their context windows from 4K to over 1M tokens, yet the ability to effectively utilize long contexts—particularly for complex reasoning and agentic tasks—has not scaled proportionally. In this survey, we provide a systematic review of training data strategies for long-context LLMs through a data-centric lens, with reasoning and agentic capabilities as the target. We organize the long-context training pipeline into three stages—pretraining (coverage-driven), midtraining (capability-driven), and post-training (reasoning-driven)—and examine how data design philosophy evolves across them. For pretraining, we review data collection strategies, length-aware quality filtering, domain mixing, and curriculum design for extending context length at scale. For midtraining, we identify two complementary data construction routes (natural reformulation and synthetic proxy tasks) targeting three training signals: retrieval precision, information densification, and reasoning depth, and argue that effective agent behavior emerges from deliberate data design as a co-development objective. For post-training, we characterize canonical long-context task forms, review task-oriented data synthesis strategies that serve both SFT and RL, and discuss training recipes including reward design and RL algorithms tailored to long-context reasoning. We further examine evaluation benchmarks with emphasis on the gap between retrieval-oriented and reasoning-oriented assessment, and highlight the under-explored dimension of knowledge reliability. Finally, we outline open challenges including data efficiency, scaling laws, and evaluation standardization. By consolidating recent advances under a unified data-centric framework targeting reasoning and agentic capabilities, we aim to provide a useful reference for researchers and engineers.

PQE Committee

Chair: Prof. YU, Xu Jeffrey
Prime Supervisor: Prof. LI, Jia
Co-Supervisor: Prof. TSUNG, Fu-Gee (online)
Examiner: Prof. DING, Zishuo

Date

10 June 2026

Time

11:00:00 - 12:00:00

Location

E1-147, HKUST(GZ)

Join Link

Zoom Meeting ID:
933 4511 6271

Tencent Meeting ID:
dsa2026