2602.17684v1 Feb 04, 2026 cs.LG

CodeScaler: 실행 없이 보상 모델을 활용한 코드 LLM 훈련 및 추론 성능 확장

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Execution-Free Reward Models

Zhijiang Guo

HKUST (GZ)

Citations: 4,271

h-index: 30

Xinyu Zhou

Citations: 57

h-index: 4

Boyu Zhu

Citations: 61

h-index: 4

Hanxu Hu

Citations: 54

h-index: 3

Mingzhe Du

Citations: 84

h-index: 5

Haotian Zhang

Citations: 103

h-index: 3

Huiming Wang

Citations: 73

h-index: 3

Xiao Zhu

Citations: 27

h-index: 2

강화 학습 기반의 검증 가능한 보상(RLVR)은 단위 테스트로부터 얻은 실행 기반 피드백을 활용하여 최근 코드 LLM 분야에서 상당한 발전을 이끌었지만, 고품질 테스트 케이스의 가용성과 신뢰성으로 인해 근본적인 확장성 제약이 존재합니다. 본 논문에서는 코드 생성 훈련 및 추론 성능을 확장하도록 설계된 실행 불필요 보상 모델인 CodeScaler를 제안합니다. CodeScaler는 검증된 코드 문제에서 파생된 신중하게 선별된 선호도 데이터를 기반으로 훈련되며, 구문 인지 코드 추출 및 안정적이고 견고한 최적화를 보장하기 위한 보상 형상화 기법을 포함합니다. 5개의 코딩 벤치마크에서 CodeScaler는 Qwen3-8B-Base 모델의 성능을 평균 +11.72 포인트 향상시키고, 실행 기반 강화 학습보다 +1.82 포인트 더 높은 성능을 보이며, 테스트 케이스 없이도 합성 데이터셋에 대한 확장 가능한 강화 학습을 가능하게 합니다. 추론 시, CodeScaler는 효과적인 테스트 시간 확장 방법으로 작용하여, 단위 테스트 방식과 비교 가능한 성능을 제공하면서 지연 시간을 10배 줄입니다. 또한, CodeScaler는 기존 보상 모델보다 RM-Bench에서 코드 영역뿐만 아니라 일반 및 추론 영역에서도 평균 +2.7 포인트 더 높은 성능을 보입니다.

Original Abstract

Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, an execution-free reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across five coding benchmarks, CodeScaler improves Qwen3-8B-Base by an average of +11.72 points, outperforming binary execution-based RL by +1.82 points, and enables scalable reinforcement learning on synthetic datasets without any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

2 Citations

0 Influential

15 Altmetric

77.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!