2601.04809v2 Jan 08, 2026 cs.AI

SCALER: 추론을 위한 합성 확장형 적응형 학습 환경

SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning

Changyi Xiao

Citations: 33

h-index: 3

Yixin Cao

Citations: 46

h-index: 4

Caijun Xu

Citations: 1

h-index: 1

Zhongyuan Peng

Citations: 8

h-index: 1

Xinrun Wang

Citations: 119

h-index: 5

강화학습(RL)은 거대 언어 모델의 추론 능력을 향상시키는 원칙적인 방법을 제공하지만, 그 효과는 모델이 발전함에 따라 유익함을 유지하는 학습 신호에 달려 있다. 실제로는 작업 난이도가 모델의 역량과 잘 맞지 않거나, 학습이 반복되는 좁은 범위의 문제 패턴에 의해 지배될 때 강화학습의 진전이 둔화되는 경우가 많다. 이러한 문제들을 동시에 해결하기 위해, 우리는 적응형 환경 설계를 통해 효과적인 학습 신호를 유지하는 프레임워크인 SCALER(추론을 위한 합성 확장형 적응형 학습 환경)를 제안한다. SCALER는 실세계 프로그래밍 문제를 난이도 조절이 가능하고 무제한 인스턴스 생성이 가능한 검증된 추론 환경으로 변환하는 확장 가능한 합성 파이프라인을 도입하여, 강력한 정답 검증을 보장하면서도 유한한 데이터셋을 넘어서는 강화학습 훈련을 가능하게 한다. 더 나아가, SCALER는 모델의 역량 한계를 추적하고 분포 다양성을 유지하기 위해 인스턴스 난이도를 동적으로 조절하고 활성 환경 집합을 선별하는 적응형 다중 환경 RL 전략을 채택한다. 이러한 상호 적응은 보상 희소성 문제를 방지하고, 좁은 작업 패턴에 대한 과적합을 완화하며, 훈련 전반에 걸쳐 지속적인 성능 향상을 지원한다. 광범위한 실험 결과, SCALER는 다양한 추론 벤치마크에서 데이터셋 기반 RL 베이스라인보다 일관되게 우수한 성능을 보였으며, 더 안정적이고 장기적인 훈련 역학을 나타냈다.

Original Abstract

Reinforcement learning (RL) offers a principled way to enhance the reasoning capabilities of large language models, yet its effectiveness hinges on training signals that remain informative as models evolve. In practice, RL progress often slows when task difficulty becomes poorly aligned with model capability, or when training is dominated by a narrow set of recurring problem patterns. To jointly address these issues, we propose SCALER (Synthetic sCalable Adaptive Learning Environment for Reasoning), a framework that sustains effective learning signals through adaptive environment design. SCALER introduces a scalable synthesis pipeline that converts real-world programming problems into verifiable reasoning environments with controllable difficulty and unbounded instance generation, enabling RL training beyond finite datasets while preserving strong correctness guarantees. Building on this, SCALER further employs an adaptive multi-environment RL strategy that dynamically adjusts instance difficulty and curates the active set of environments to track the model's capability frontier and maintain distributional diversity. This co-adaptation prevents reward sparsity, mitigates overfitting to narrow task patterns, and supports sustained improvement throughout training. Extensive experiments show that SCALER consistently outperforms dataset-based RL baselines across diverse reasoning benchmarks and exhibits more stable, long-horizon training dynamics.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!