2601.22803v1 Jan 30, 2026 cs.AI

CVeDRL: 난이도 인식 강화학습을 통한 효율적인 코드 검증기

CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning

Miao Zhang

Citations: 37

h-index: 3

Weili Guan

Citations: 6

h-index: 2

Peiming Guo

Citations: 27

h-index: 2

Meishan Zhang

Citations: 790

h-index: 12

Xuebo Liu

Citations: 477

h-index: 11

Min Zhang

Citations: 88

h-index: 5

Ji Shi

Citations: 89

h-index: 4

코드 검증기는 LLM 기반 코드 생성의 사후 검증에서 중요한 역할을 하지만, 기존의 지도 미세 조정(Supervised Fine-Tuning) 방법들은 데이터 부족, 높은 실패율, 낮은 추론 효율성 등의 문제를 겪고 있습니다. 강화학습(RL)은 레이블 된 데이터 없이 실행 기반 보상을 통해 모델을 최적화하는 유망한 대안을 제시하지만, 기능성 보상만 사용하는 단순한 RL은 어려운 분기(branch)나 샘플에 대해 효과적인 단위 테스트를 생성하지 못한다는 것이 예비 실험을 통해 밝혀졌습니다. 우리는 먼저 분기 커버리지, 샘플 난이도, 구문적 및 기능적 정확성을 RL 보상으로 통합 모델링할 수 있으며, 이를 최적화하면 단위 테스트 기반 검증의 신뢰성을 높일 수 있음을 이론적으로 분석했습니다. 이 분석을 토대로 구문 및 기능을 인식하는 보상을 설계하고, 지수적 보상 쉐이핑(exponential reward shaping)과 정적 분석 지표를 활용한 분기 및 샘플 난이도 인식 RL을 제안합니다. 이 방법론을 적용한 CVeDRL은 불과 0.6B 파라미터만으로 최첨단(SOTA) 성능을 달성했으며, GPT-3.5 대비 최대 28.97% 높은 통과율과 15.08% 높은 분기 커버리지를 기록함과 동시에 경쟁 베이스라인 대비 20배 이상 빠른 추론 속도를 보였습니다. 코드는 https://github.com/LIGHTCHASER1/CVeDRL.git 에서 공개되어 있습니다.

Original Abstract

Code verifiers play a critical role in post-verification for LLM-based code generation, yet existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency. While reinforcement learning (RL) offers a promising alternative by optimizing models through execution-driven rewards without labeled supervision, our preliminary results show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples. We first theoretically analyze showing that branch coverage, sample difficulty, syntactic and functional correctness can be jointly modeled as RL rewards, where optimizing these signals can improve the reliability of unit-test-based verification. Guided by this analysis, we design syntax- and functionality-aware rewards and further propose branch- and sample-difficulty--aware RL using exponential reward shaping and static analysis metrics. With this formulation, CVeDRL achieves state-of-the-art performance with only 0.6B parameters, yielding up to 28.97% higher pass rate and 15.08% higher branch coverage than GPT-3.5, while delivering over $20\times$ faster inference than competitive baselines. Code is available at https://github.com/LIGHTCHASER1/CVeDRL.git

0 Citations

0 Influential

26 Altmetric

130.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!