2605.27354v1 May 26, 2026 cs.LG

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

Jinwu Hu
Jinwu Hu
Citations: 83
h-index: 5
Lei Hou
Lei Hou
Citations: 452
h-index: 8
Juanzi Li
Juanzi Li
Citations: 932
h-index: 13
Xiaozhi Wang
Xiaozhi Wang
Citations: 108
h-index: 4
Yi Jing
Yi Jing
Citations: 4
h-index: 1
Zao Dai
Zao Dai
Citations: 0
h-index: 0
Zijun Yao
Zijun Yao
Tsinghua University
Citations: 682
h-index: 12

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.

0 Citations
0 Influential
6.5 Altmetric
32.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!