2603.24202v1 Mar 25, 2026 cs.LG

합성 데이터와 교육 과정을 활용한 코드 생성 강화 학습의 확장성에 대한 심층 연구

A Deep Dive into Scaling RL for Code Generation with Synthetic Data and Curricula

Gabriel Synnaeve

Citations: 59,369

h-index: 57

Cansu Sancaktar

Citations: 156

h-index: 5

David Zhang

Citations: 58

h-index: 2

Taco Cohen

Citations: 247

h-index: 6

강화 학습(RL)은 지도 학습 미세 조정 외에도 대규모 언어 모델의 성능을 향상시키는 강력한 패러다임으로 부상했지만, 데이터의 다양성과 구조, 즉 데이터 양 자체가 아닌 이러한 요소들이 성능 향상의 한계를 결정하기 때문에, 확장성을 유지하는 것은 여전히 해결해야 할 과제입니다. 본 연구에서는 교사 모델이 문맥 내 학생의 성능 요약을 기반으로 문제를 반복적으로 개선하는, 확장 가능한 다중 단계 합성 데이터 생성 파이프라인을 소개합니다. 이를 통해 교사 모델의 미세 조정 없이도 구조화된 난이도 향상을 구현합니다. 단일 단계 생성 방식과 비교하여, 본 연구의 다중 단계 접근 방식은 유효한 합성 문제의 생성량을 크게 향상시키고, 동일한 핵심 작업의 더 쉽고 더 어려운 변형체를 자연스럽게 생성하여 교육 기반 학습을 지원합니다. 본 연구에서는 Llama3.1-8B Instruct 및 Qwen3-8B Base 모델 패밀리를 대상으로, 또한 Qwen2.5-32B 모델에 대한 추가 확장 실험을 통해 작업 난이도, 교육 과정 스케줄링, 환경 다양성이 강화 학습 훈련 과정에 미치는 영향을 체계적으로 분석했습니다. 실험 결과, 합성 데이터 증강은 일관적으로 해당 도메인의 코드 생성 성능을 향상시키고, 대부분의 경우 외부 도메인의 수학 문제 해결 성능을 향상시키는 것으로 나타났습니다. 또한, 본 연구는 교육 과정 설계 및 데이터 다양성이 강화 학습 훈련 역학에 미치는 영향에 대한 실증적인 통찰력을 제공합니다.

Original Abstract

Reinforcement learning (RL) has emerged as a powerful paradigm for improving large language models beyond supervised fine-tuning, yet sustaining performance gains at scale remains an open challenge, as data diversity and structure, rather than volume alone, become the limiting factor. We address this by introducing a scalable multi-turn synthetic data generation pipeline in which a teacher model iteratively refines problems based on in-context student performance summaries, producing structured difficulty progressions without any teacher fine-tuning. Compared to single-turn generation, this multi-turn approach substantially improves the yield of valid synthetic problems and naturally produces stepping stones, i.e. easier and harder variants of the same core task, that support curriculum-based training. We systematically study how task difficulty, curriculum scheduling, and environment diversity interact during RL training across the Llama3.1-8B Instruct and Qwen3-8B Base model families, with additional scaling experiments on Qwen2.5-32B. Our results show that synthetic augmentation consistently improves in-domain code and in most cases out-of-domain math performance, and we provide empirical insights into how curriculum design and data diversity jointly shape RL training dynamics.

1 Citations

0 Influential

28.5 Altmetric

143.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!