2601.12186v1 Jan 17, 2026 cs.SE

알레테이아: 코드 검증기의 작동 원리 - RLVR(강화 학습 기반 검증 보상)의 핵심 요소는 무엇인가?

Aletheia: What Makes RLVR For Code Verifiers Tick?

Iryna Gurevych

Citations: 1,356

h-index: 17

Vatsal Venkatkrishna

Citations: 14

h-index: 2

Indraneil Paul

Citations: 137

h-index: 4

강화 학습 기반 검증 보상(RLVR)을 통해 훈련된 다중 영역 추론 검증기는, 모델 출력의 안정적인 평가 및 재순위를 가능하게 하여 대규모 언어 모델(LLM)의 사후 훈련 파이프라인에서 중요한 역할을 합니다. 그러나 코드 생성 분야에서의 이러한 검증기 도입은 상대적으로 제한적이며, 실행 피드백이 주요 신호로 사용되는 경우가 많습니다. 그럼에도 불구하고, 코드 검증기는 실행 피드백을 얻기 어려운 상황에서 모델 출력의 품질을 판단하는 데 유용하며, 코드 생성 사후 훈련 도구 상자에 잠재적으로 강력한 기능을 추가할 수 있습니다. 이에, 저희는 다양한 정책 모델 및 공변량 변화에 따른 코드 검증기의 안정성을 실행 기반으로 평가할 수 있는 제어된 테스트 환경인 '알레테이아'를 개발하고 공개합니다. 저희는 RLVR 기반 검증기 훈련의 성공에 기여하는 것으로 널리 알려진 구성 요소들을 분석합니다: (1) 중간 추론 과정, (2) 부정 샘플 학습, (3) 온-정책 훈련. 실험 결과는 RLVR의 최적성을 보여주지만, 훈련 과정을 단순화할 수 있는 중요한 기회를 발견했습니다. 특히, 코드 검증은 훈련 및 추론 시간에 긍정적인 확장성을 보이지만, 작은 검증기 크기에서는 온-정책 학습이 핵심 구성 요소이며, 더 큰 규모에서는 추론 기반 훈련이 가장 중요한 구성 요소로 나타났습니다.

Original Abstract

Multi-domain thinking verifiers trained via Reinforcement Learning from Verifiable Rewards (RLVR) are a prominent fixture of the Large Language Model (LLM) post-training pipeline, owing to their ability to robustly rate and rerank model outputs. However, the adoption of such verifiers towards code generation has been comparatively sparse, with execution feedback constituting the dominant signal. Nonetheless, code verifiers remain valuable toward judging model outputs in scenarios where execution feedback is hard to obtain and are a potentially powerful addition to the code generation post-training toolbox. To this end, we create and open-source Aletheia, a controlled testbed that enables execution-grounded evaluation of code verifiers' robustness across disparate policy models and covariate shifts. We examine components of the RLVR-based verifier training recipe widely credited for its success: (1) intermediate thinking traces, (2) learning from negative samples, and (3) on-policy training. While experiments show the optimality of RLVR, we uncover important opportunities to simplify the recipe. Particularly, despite code verification exhibiting positive training- and inference-time scaling, on-policy learning stands out as the key component at small verifier sizes, and thinking-based training emerges as the most important component at larger scales.

1 Citations

0 Influential

8.5 Altmetric

43.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!