2604.25419v1 Apr 28, 2026 cs.AI

JURY-RL: 투표 기반 제안, 증명 기반 평가 - 레이블이 없는 강화 학습 기반 언어 모델 훈련

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Dayiheng Liu

Citations: 19,971

h-index: 23

Jing Wu

Citations: 26

h-index: 3

Xinggao Liu

Citations: 397

h-index: 12

Xinjie Chen

Citations: 54

h-index: 3

Biao Fu

Xiamen University

Citations: 186

h-index: 7

Guoxin Chen

Citations: 351

h-index: 7

Minpeng Liao

Citations: 639

h-index: 11

검증 가능한 보상을 사용하는 강화 학습(RLVR)은 대규모 언어 모델(LLM)의 추론 능력을 향상시키지만, 일반적인 RLVR은 종종 인간이 주석을 단 답변이나 신중하게 설계된 보상 사양에 의존합니다. 기계가 검증할 수 있는 분야에서는, 다수 투표 또는 LLM을 평가자로 사용하는 것과 같은 레이블이 없는 대안이 주석 비용을 줄일 수 있지만, 훈련을 불안정하게 만드는 오탐을 발생시킬 수 있습니다. 본 논문에서는 답변 제안과 보상 부여를 분리하는 레이블이 없는 RLVR 프레임워크인 JURY-RL을 소개합니다. 모델의 다양한 실행 결과로부터 얻은 투표는 후보 답변을 제안하고, 형식적인 검증기가 해당 후보가 긍정적인 보상을 받을 수 있는지 결정합니다. 구체적으로, Lean에서 성공적으로 검증된 다수 투표 답변과 일치하는 실행 결과에 대해서만 보상을 제공합니다. 검증 결과가 명확하지 않은 경우, ResZero(잔차-영)라는 대체 보상을 사용합니다. ResZero는 검증되지 않은 다수 투표 제안을 무시하고, 남은 답변들에 대해 평균이 0이고 분산을 유지하는 신호를 재분배합니다. 이러한 설계는 검증 불가능한 합의를 강화하지 않고도 안정적인 최적화 기울기를 유지합니다. 수학 데이터로 훈련된 세 가지 기본 모델에서, JURY-RL은 다른 레이블이 없는 기준 모델보다 수학적 추론 벤치마크에서 일관되게 우수한 성능을 보이며, 코드 생성 및 일반 벤치마크로도 경쟁력 있는 성능을 보입니다. JURY-RL은 지도 학습과 비교하여 pass@1 성능이 유사하며, 더 높은 pass@k 값과 응답 다양성을 통해 우수한 일반화 능력을 보여줍니다.

Original Abstract

Reinforcement learning with verifiable rewards (RLVR) enhances the reasoning of large language models (LLMs), but standard RLVR often depends on human-annotated answers or carefully curated reward specifications. In machine-checkable domains, label-free alternatives such as majority voting or LLM-as-a-judge remove annotation cost but can introduce false positives that destabilize training. We introduce JURY-RL, a label-free RLVR framework that decouples answer proposal from reward disposal: votes from model rollouts propose a candidate answer, and a formal verifier determines whether that candidate can receive positive reward. Concretely, only rollouts matching the plurality-voted answer are rewarded when that answer is successfully verified in Lean. When verification is inconclusive, we invoke ResZero (Residual-Zero), a fallback reward that discards the unverified plurality proposal and redistributes a zero-mean, variance-preserving signal over the residual answers. This design maintains a stable optimization gradient without reinforcing unverifiable consensus. Across three backbone models trained on mathematical data, JURY-RL consistently outperforms other label-free baselines on mathematical reasoning benchmarks and transfers competitively to code generation and general benchmarks. It attains pass@1 performance comparable to supervised ground-truth training, with superior generalization demonstrated by higher pass@k and response diversity.

0 Citations

0 Influential

11.5 Altmetric

57.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!