2602.03195v2 Feb 03, 2026 cs.LG

대규모 언어 모델을 위한 유망 토큰 기반 강화 학습

Reinforcement Learning with Promising Tokens for Large Language Models

Xiangru Tang

Citations: 102

h-index: 4

Jing-Cheng Pang

Citations: 12

h-index: 2

Li-Chun Lu

Citations: 188

h-index: 4

Kun Jiang

Citations: 3

h-index: 1

Sijie Wu

Citations: 4

h-index: 1

Kai Zhang

Citations: 4

h-index: 1

Xubin Li

Citations: 4

h-index: 1

강화 학습(RL)은 대규모 언어 모델(LLM)을 조정하고 최적화하는 데 핵심적인 패러다임으로 부상했습니다. 기존 접근 방식은 LLM을 정책으로 간주하고 전체 어휘 공간에 직접 RL을 적용합니다. 그러나 이러한 방식은 정책이 진정으로 합리적인 토큰 간의 의사 결정에 집중하는 것을 방해할 수 있는, 문맥상 관련 없는 토큰의 방대한 부분을 행동 공간에 포함합니다. 본 연구에서는 유효한 추론 경로는 본질적으로 저차원 부분 공간에 집중된다는 것을 확인하고, 이러한 통찰력을 바탕으로 전략적 의사 결정을 토큰 생성과 분리하여 행동 공간 문제를 완화하는 프레임워크인 유망 토큰 기반 강화 학습(RLPT)을 제안합니다. 특히, RLPT는 기본 모델의 의미적 사전 지식을 활용하여 동적으로 유망한 토큰 집합을 식별하고, 마스킹을 통해 정책 최적화를 이 정제된 부분 집합으로만 제한합니다. 이론적 분석과 실험 결과는 RLPT가 그래디언트 분산을 효과적으로 줄이고, 학습 과정을 안정화하며, 샘플 효율성을 향상시킨다는 것을 보여줍니다. 수학, 코딩 및 통신 분야의 추론 실험 결과는 RLPT가 표준 RL 기준 성능을 능가하며, 다양한 모델 크기(4B 및 8B)와 RL 알고리즘(GRPO 및 DAPO)에 효과적으로 통합될 수 있음을 입증합니다.

Original Abstract

Reinforcement learning (RL) has emerged as a key paradigm for aligning and optimizing large language models (LLMs). Standard approaches treat the LLM as the policy and apply RL directly over the full vocabulary space. However, this formulation includes the massive tail of contextually irrelevant tokens in the action space, which could distract the policy from focusing on decision-making among the truly reasonable tokens. In this work, we verify that valid reasoning paths could inherently concentrate within a low-rank subspace. Based on this insight, we introduce Reinforcement Learning with Promising Tokens (RLPT), a framework that mitigates the action space issue by decoupling strategic decision-making from token generation. Specifically, RLPT leverages the semantic priors of the base model to identify a dynamic set of promising tokens and constrains policy optimization exclusively to this refined subset via masking. Theoretical analysis and empirical results demonstrate that RLPT effectively reduces gradient variance, stabilizes the training process, and improves sample efficiency. Experiment results on math, coding, and telecom reasoning show that RLPT outperforms standard RL baselines and integrates effectively across various model sizes (4B and 8B) and RL algorithms (GRPO and DAPO).

1 Citations

0 Influential

2 Altmetric

11.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!