2602.03309v1 Feb 03, 2026 cs.LG

엔트로피 게이트 기반 선택적 정책 최적화: 대규모 언어 모델의 혼합 학습을 위한 토큰 레벨 그래디언트 할당

Entropy-Gated Selective Policy Optimization:Token-Level Gradient Allocation for Hybrid Training of Large Language Models

Yuelin Hu

Citations: 5

h-index: 2

Zhengxue Cheng

Citations: 263

h-index: 9

Li Song

Citations: 71

h-index: 3

Wei Liu

Citations: 12

h-index: 3

대규모 언어 모델을 위한 혼합 학습 방법은 전문가의 데모를 활용한 지도 미세 조정(SFT)과 모델 출력을 기반으로 한 강화 학습(RL)을 결합하는 방식으로, 일반적으로 샘플 레벨에서 수행됩니다. 본 논문에서는 샘플 레벨 혼합을 확장하여 토큰 레벨 그래디언트 조절을 적용하는 세 단계 프레임워크인 엔트로피 게이트 기반 선택적 정책 최적화(EGSPO)를 제안합니다. 1단계, SFT 전문가 학습 단계에서는 순수한 SFT 손실을 사용하여 안정적인 초기 정책을 구축합니다. 2단계, RL rollout 생성 단계에서는 현재 정책에서 트레이젝토리를 샘플링하고 각 토큰에 대한 예측 엔트로피를 계산합니다. 3단계, EGSPO 메커니즘은 엔트로피 게이트 기반의 그래디언트 할당을 수행합니다. 예측 엔트로피 모듈은 높은 엔트로피를 가진 토큰을 전체 PPO 업데이트에 할당하여 탐색을 장려하고, 낮은 엔트로피를 가진 토큰을 감쇠된 PPO 업데이트에 할당하여 분산을 줄이고 지식을 보존합니다. 특히, 두 가지 분기 모두 장점 함수 A_t를 포함하여 잘못된 트레이젝토리가 일관된 부정적인 학습 신호를 받도록 하고, 확신에 찬 오류가 강화되는 것을 방지합니다. EGSPO는 수학적 추론 벤치마크에서 일관된 성능 향상을 보여주며, CHORD phi 기준 모델 대비 AIME에서 3.8%, MATH에서 2.9%의 성능 향상을 달성했으며, 추가적인 계산 오버헤드는 3.4%에 불과합니다.

Original Abstract

Hybrid training methods for large language models combine supervised fine tuning (SFT) on expert demonstrations with reinforcement learning (RL) on model rollouts, typically at the sample level. We propose Entropy Gated Selective Policy Optimization (EGSPO), a three stage framework that extends sample level mixing with token level gradient modulation. Stage 1, SFT expert learning, establishes a reliable warm up policy using expert demonstrations with a pure SFT loss. Stage 2, RL rollout generation, samples trajectories from the current policy and computes per token predictive entropy. Stage 3, the EGSPO mechanism, applies entropy gated gradient allocation: a predictive entropy module routes high entropy tokens to full PPO updates to encourage exploration, and low entropy tokens to attenuated PPO updates to reduce variance and preserve knowledge. Critically, both branches incorporate the advantage function A_t, ensuring that incorrect trajectories receive consistent negative learning signals and preventing reinforcement of confident errors. EGSPO achieves consistent improvements on mathematical reasoning benchmarks, with gains of 3.8 percent on AIME and 2.9 percent on MATH over the CHORD phi baseline, while incurring only 3.4 percent additional computational overhead.

3 Citations

0 Influential

4.5 Altmetric

25.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!