2605.29303v1 May 28, 2026 cs.AI

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

Zhi Zheng
Zhi Zheng
Citations: 174
h-index: 8
Tong Xu
Tong Xu
Citations: 65
h-index: 5
Enhong Chen
Enhong Chen
Citations: 1,115
h-index: 14
Qi Liu
Qi Liu
Citations: 21
h-index: 2
Mingdi Sun
Mingdi Sun
Citations: 44
h-index: 3
Yongyi He
Yongyi He
Citations: 53
h-index: 4
Yi Zheng
Yi Zheng
Citations: 15
h-index: 2
Zhefeng Wang
Zhefeng Wang
Citations: 71
h-index: 4

Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE-USTC/EKSFT.

0 Citations
0 Influential
27 Altmetric
135.0 Score
Original PDF
0

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!