2605.30288v1 May 28, 2026 cs.AI

MIRA: Mid-training Rubric Anchoring for Source-Aware Data Selection

Shukai Liu
Shukai Liu
Citations: 202
h-index: 6
Xianglong Liu
Xianglong Liu
Citations: 82
h-index: 5
Jiajun Wu
Jiajun Wu
Citations: 655
h-index: 9
Siheng Chen
Siheng Chen
Citations: 317
h-index: 10
Yaxin Du
Yaxin Du
Citations: 436
h-index: 8
T. Zheng
T. Zheng
Citations: 616
h-index: 7
Jian Yang
Jian Yang
Citations: 14
h-index: 1
Haowen Wang
Haowen Wang
Citations: 62
h-index: 5
Yuxuan Zhang
Yuxuan Zhang
Citations: 0
h-index: 0
Pingjie Wang
Pingjie Wang
Citations: 67
h-index: 5
Mingxun Zhou
Mingxun Zhou
Carnegie Mellon University
Citations: 383
h-index: 10

Mid-training has become an important stage in modern LLM development, using large-scale curated mixtures to strengthen capabilities before final post-training. Its data selection problem is distinct: the data are optimized under a pretraining-style objective at near-pretraining scale, but are curated toward downstream capabilities and drawn from heterogeneous sources with different formats and training roles. As a result, effective selection requires both scalability and source-adaptive semantic criteria. Existing model-based methods scale well, but provide only implicit quality signals. Semantic selection methods offer stronger judgments, but usually assume fixed rubrics or standardized data formats. To address this mismatch, we propose MIRA, a source-aware filtering framework based on self-anchored rubric discovery. The key idea is to make rubric construction part of data selection: MIRA first discovers what should be evaluated for each source group, then distills those judgments into scalable student scorers for full-corpus filtering. On code-oriented mid-training with 21 sources and 5 source groups, MIRA outperforms selection baselines across nine code benchmarks and matches the full-corpus run while using only half the tokens.

0 Citations
0 Influential
5 Altmetric
25.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!