2602.04265v1 Feb 04, 2026 cs.LG

두꺼움에서 얇음으로: 인간 기반 학습 역학을 활용한 LLM 추론을 위한 보상 형성

Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning

Zhen Yang

Citations: 6,235

h-index: 7

Wenze Lin

Citations: 141

h-index: 3

Xitai Jiang

Citations: 8

h-index: 1

P. Ma

Citations: 13

h-index: 1

Gao Huang

Citations: 133

h-index: 3

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키는 유망한 패러다임으로 부상했습니다. 그러나 이 방법은 종종 엔트로피 붕괴, 과도한 장황함, 그리고 어려운 문제에 대한 충분하지 못한 탐색과 같은 어려움에 직면합니다. 더욱 중요한 점은 기존의 보상 체계가 문제 해결 과정에서 필요한 광범위한 탐색과 숙달된 지식에 필요한 효율성 사이의 균형을 맞추지 못한다는 것입니다. 본 연구에서는 인간 학습 과정을 모방한 동적 보상 프레임워크인 T2T(Thickening-to-Thinning)를 소개합니다. 구체적으로, T2T는 두 가지 단계로 구성됩니다. (1) 잘못된 시도 시, T2T는 "두꺼워짐(thickening)"을 장려하여 탐색 공간을 넓히고 새로운 해결 경로를 탐색합니다. (2) 올바른 답을 얻었을 때는 "얇아짐(thinning)"으로 전환하여, 불필요한 반복을 방지하고 모델의 자신감을 높이며 추론 능력을 강화합니다. Qwen 시리즈와 Deepseek 모델을 사용하여 수학 벤치마크(MATH-500, AIME, AMC)에서 수행한 광범위한 실험 결과, T2T는 기존의 GRPO 및 최신 모델보다 훨씬 뛰어난 성능을 보여주었습니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising paradigm for enhancing reasoning in Large Language Models (LLMs). However, it frequently encounters challenges such as entropy collapse, excessive verbosity, and insufficient exploration for hard problems. Crucially, existing reward schemes fail to distinguish between the need for extensive search during problem-solving and the efficiency required for mastered knowledge. In this work, we introduce T2T(Thickening-to-Thinning), a dynamic reward framework inspired by human learning processes. Specifically, it implements a dual-phase mechanism: (1) On incorrect attempts, T2T incentivizes "thickening" (longer trajectories) to broaden the search space and explore novel solution paths; (2) Upon achieving correctness, it shifts to "thinning", imposing length penalties to discourage redundancy, thereby fostering model confidence and crystallizing reasoning capabilities. Extensive experiments on mathematical benchmarks (MATH-500, AIME, AMC) across Qwen-series and Deepseek models demonstrate that T2T significantly outperforms standard GRPO and recent baselines, achieving superior performance.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!