2603.17775v1 Mar 18, 2026 cs.CL

CoVerRL: 생성기-검증기 공동 진화를 통한 라벨 없는 추론의 합의 오류 극복

CoVerRL: Breaking the Consensus Trap in Label-Free Reasoning via Generator-Verifier Co-Evolution

Ruiqing Zhang

Citations: 2

h-index: 1

Yuchen Yan

Zhejiang University

Citations: 385

h-index: 11

Yongliang Shen

Citations: 265

h-index: 9

Tengyu Pan

Citations: 72

h-index: 5

Gaiyang Han

Citations: 0

h-index: 0

Jun Xiao

Citations: 179

h-index: 7

Zixuan Wang

Citations: 66

h-index: 2

Wanqi Zhang

Citations: 14

h-index: 2

Weiming Lu

Citations: 12

h-index: 2

라벨 없는 강화 학습은 대규모 언어 모델이 정답 데이터의 감독 없이 추론 능력을 향상시키는 것을 가능하게 합니다. 일반적으로는 다수결로 결정된 답변을 준거 라벨로 사용합니다. 그러나 본 연구에서는 심각한 문제점을 발견했습니다. 훈련 과정에서 모델은 자기 일관성을 최대화하려 하지만, 결과적으로 출력의 다양성이 감소하여 모델이 오류를 확신하고 강화하게 되며, 이러한 오류는 감지하기 어렵습니다. 이를 우리는 '합의 오류'라고 명명합니다. 이러한 문제를 해결하기 위해, 본 연구에서는 단일 모델이 생성기와 검증자 역할을 번갈아 수행하는 프레임워크인 CoVerRL을 제안합니다. 각 기능은 서로를 향상시키는 방식으로 작동합니다. 다수결 투표는 검증자 훈련을 위한 노이즈가 있지만 유용한 정보를 제공하며, 성능이 향상된 검증자는 점진적으로 준거 라벨에서 자기 일관적인 오류를 제거합니다. 이러한 공동 진화는 훈련 전반에 걸쳐 높은 보상 정확도를 유지하는 선순환을 만듭니다. Qwen 및 Llama 모델 패밀리를 대상으로 한 실험 결과, CoVerRL은 라벨 없는 기본 모델보다 수학적 추론 벤치마크에서 4.7~5.9% 더 높은 성능을 보였습니다. 또한, 자기 검증 정확도가 약 55%에서 85% 이상으로 향상되어, 두 가지 기능이 실제로 공동으로 진화한다는 것을 확인했습니다.

Original Abstract

Label-free reinforcement learning enables large language models to improve reasoning capabilities without ground-truth supervision, typically by treating majority-voted answers as pseudo-labels. However, we identify a critical failure mode: as training maximizes self-consistency, output diversity collapses, causing the model to confidently reinforce systematic errors that evade detection. We term this the consensus trap. To escape it, we propose CoVerRL, a framework where a single model alternates between generator and verifier roles, with each capability bootstrapping the other. Majority voting provides noisy but informative supervision for training the verifier, while the improving verifier progressively filters self-consistent errors from pseudo-labels. This co-evolution creates a virtuous cycle that maintains high reward accuracy throughout training. Experiments across Qwen and Llama model families demonstrate that CoVerRL outperforms label-free baselines by 4.7-5.9\% on mathematical reasoning benchmarks. Moreover, self-verification accuracy improves from around 55\% to over 85\%, confirming that both capabilities genuinely co-evolve.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!