2601.08468v1 Jan 13, 2026 cs.CL

JudgeRLVR: 먼저 판단하고, 그 다음 생성하여 효율적인 추론을 달성하는 방법

JudgeRLVR: Judge First, Generate Second for Efficient Reasoning

Yudong Wang

Citations: 93

h-index: 4

Hailin Zhang

Citations: 10

h-index: 2

Jiangshan Duo

Citations: 167

h-index: 3

Hanyu Li

Citations: 6

h-index: 2

Sujian Li

Citations: 5

h-index: 2

Liang Zhao

Citations: 46

h-index: 2

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델에서의 추론을 위한 표준적인 패러다임이 되었습니다. 그러나 최종 답변의 정확성만을 최적화하는 것은 모델을 무의미하고 장황한 탐색으로 이끌 수 있으며, 이는 체계적인 계획보다는 광범위한 시행착오를 통해 해결책을 찾도록 유도합니다. 길이 제한과 같은 휴리스틱 제약은 장황함을 줄일 수 있지만, 종종 필수적인 추론 단계를 잘라내어 효율성과 검증 사이의 어려운 균형을 초래합니다. 본 논문에서는 효율적인 생성을 위해서는 판별력이 필수적이라고 주장합니다. 유효한 해결책을 구별하는 능력을 학습함으로써 모델은 검색 공간을 축소하는 지침 신호를 내재화할 수 있습니다. 우리는 JudgeRLVR이라는 두 단계의 판단-생성 패러다임을 제안합니다. 첫 번째 단계에서는 모델을 검증 가능한 답변을 사용하여 해결책 응답을 평가하도록 훈련합니다. 두 번째 단계에서는 동일한 모델을 Judge에서 초기화된 일반적인 생성 RLVR을 사용하여 미세 조정합니다. 동일한 수학 도메인 학습 데이터를 사용하는 일반적인 RLVR과 비교하여, JudgeRLVR은 Qwen3-30B-A3B 모델에서 더 나은 품질-효율성 균형을 제공합니다. 특히, 동일한 도메인 내의 수학 문제에서 평균 정확도가 약 +3.7 포인트 향상되고 평균 생성 길이가 -42% 감소합니다. 또한, 외부 도메인 벤치마크에서 평균 정확도가 약 +4.5 포인트 향상되어 일반화 성능이 향상되었음을 보여줍니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a standard paradigm for reasoning in Large Language Models. However, optimizing solely for final-answer correctness often drives models into aimless, verbose exploration, where they rely on exhaustive trial-and-error tactics rather than structured planning to reach solutions. While heuristic constraints like length penalties can reduce verbosity, they often truncate essential reasoning steps, creating a difficult trade-off between efficiency and verification. In this paper, we argue that discriminative capability is a prerequisite for efficient generation: by learning to distinguish valid solutions, a model can internalize a guidance signal that prunes the search space. We propose JudgeRLVR, a two-stage judge-then-generate paradigm. In the first stage, we train the model to judge solution responses with verifiable answers. In the second stage, we fine-tune the same model with vanilla generating RLVR initialized from the judge. Compared to Vanilla RLVR using the same math-domain training data, JudgeRLVR achieves a better quality--efficiency trade-off for Qwen3-30B-A3B: on in-domain math, it delivers about +3.7 points average accuracy gain with -42\% average generation length; on out-of-domain benchmarks, it delivers about +4.5 points average accuracy improvement, demonstrating enhanced generalization.

2 Citations

0 Influential

2 Altmetric

12.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!