2605.30334v1 May 28, 2026 cs.AI

Demystifying Data Organization for Enhanced LLM Training

Hao Li
Hao Li
Citations: 0
h-index: 0
Wenshan Wu
Wenshan Wu
Citations: 1,456
h-index: 16
Tong Yang
Tong Yang
Citations: 22
h-index: 3
Yalun Dai
Yalun Dai
Citations: 12
h-index: 2
Yangyu Huang
Yangyu Huang
Citations: 292
h-index: 8
Xin Zhang
Xin Zhang
Citations: 6
h-index: 1
Qihao Zhao
Qihao Zhao
Citations: 273
h-index: 9
Yuanyuan Gao
Yuanyuan Gao
Citations: 71
h-index: 4
Kim-Hui Yap
Kim-Hui Yap
Citations: 0
h-index: 0
Scarlett Li
Scarlett Li
Citations: 209
h-index: 7
Yonghan Wang
Yonghan Wang
Citations: 0
h-index: 0

Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidelines for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidelines. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training. Github Link: https://github.com/microsoft/data-efficacy/

0 Citations
0 Influential
47.356005054539 Altmetric
236.8 Score
Original PDF
47

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!