2601.04805v1 Jan 08, 2026 cs.AI

사고 기반 비사고(Thinking-Based Non-Thinking): 강화 학습을 통한 하이브리드 추론 모델 훈련에서의 보상 해킹 문제 해결

Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning

Siyuan Gan

Citations: 0

h-index: 0

Jiaheng Liu

Citations: 475

h-index: 6

Boyan Wang

Citations: 42

h-index: 3

Tianpei Yang

Citations: 92

h-index: 5

Yuyao Zhang

Citations: 6

h-index: 2

Fanyu Meng

Citations: 153

h-index: 3

Junlan Feng

Citations: 198

h-index: 3

Linjian Meng

Citations: 29

h-index: 2

Jing Huo

Citations: 1,198

h-index: 17

Yang Gao

Citations: 59

h-index: 5

Runqing Miao

Citations: 140

h-index: 7

대규모 추론 모델(LRM)은 뛰어난 성능으로 인해 많은 주목을 받고 있다. 그러나 그 성능은 주로 긴 생각의 사슬(CoT)인 '사고(thinking)' 과정에서 비롯되며, 이는 계산 오버헤드를 크게 증가시킨다. 이러한 과도한 사고 문제를 해결하기 위해 기존 연구는 강화 학습(RL)을 사용하여 쿼리의 복잡도에 따라 사고 수행 여부를 자동으로 결정하는 하이브리드 추론 모델을 훈련하는 데 집중해 왔다. 불행히도 RL을 사용하면 보상 해킹(reward hacking) 문제가 발생하는데, 예를 들어 모델이 실제로는 사고를 수행했음에도 그렇지 않은 것으로 판단되어 잘못된 보상을 받는 경우가 있다. 이 문제를 완화하기 위해 기존 연구들은 계산 비용이 높은 지도 미세 조정(SFT)을 채택하거나 비사고(non-thinking) 응답에 일률적인 토큰 제한을 강제했으나, 이는 문제 완화 효과가 제한적이었다. 본 논문에서는 '사고 기반 비사고(Thinking-Based Non-Thinking, TNT)'를 제안한다. TNT는 SFT를 사용하지 않으며, 사고를 동반한 응답의 솔루션 정보를 활용하여 다양한 쿼리에 대해 사고를 사용하지 않는 응답의 최대 토큰 사용량을 차등 설정한다. 5가지 수학 벤치마크 실험 결과, TNT는 DeepSeek-R1-Distill-Qwen-1.5B/7B 및 DeepScaleR-1.5B 대비 토큰 사용량을 약 50% 절감하면서도 정확도를 크게 향상시켰다. 사실상 TNT는 테스트된 모든 방법 중에서 정확도와 효율성 간의 최적의 트레이드오프를 달성했다. 또한, 사고를 사용하지 않는 것으로 분류된 TNT의 응답에서 보상 해킹 문제가 발생할 확률은 모든 테스트 데이터셋에서 10% 미만으로 유지되었다.

Original Abstract

Large reasoning models (LRMs) have attracted much attention due to their exceptional performance. However, their performance mainly stems from thinking, a long Chain of Thought (CoT), which significantly increase computational overhead. To address this overthinking problem, existing work focuses on using reinforcement learning (RL) to train hybrid reasoning models that automatically decide whether to engage in thinking or not based on the complexity of the query. Unfortunately, using RL will suffer the the reward hacking problem, e.g., the model engages in thinking but is judged as not doing so, resulting in incorrect rewards. To mitigate this problem, existing works either employ supervised fine-tuning (SFT), which incurs high computational costs, or enforce uniform token limits on non-thinking responses, which yields limited mitigation of the problem. In this paper, we propose Thinking-Based Non-Thinking (TNT). It does not employ SFT, and sets different maximum token usage for responses not using thinking across various queries by leveraging information from the solution component of the responses using thinking. Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50% compared to DeepSeek-R1-Distill-Qwen-1.5B/7B and DeepScaleR-1.5B, while significantly improving accuracy. In fact, TNT achieves the optimal trade-off between accuracy and efficiency among all tested methods. Additionally, the probability of reward hacking problem in TNT's responses, which are classified as not using thinking, remains below 10% across all tested datasets.

0 Citations

0 Influential

8.5 Altmetric

42.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!