2604.18381v1 Apr 20, 2026 cs.AI

데이터 및 컴퓨팅 자원이 부족한 환경에서의 RLVR 효과 측정: '적은 데이터로 배우기'

Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes

Thomas Walshe

Citations: 74

h-index: 2

Derek Pham

Citations: 50

h-index: 4

Armin Parchami

Citations: 154

h-index: 6

P. Varma

Citations: 2,664

h-index: 15

Justin Bauer

Citations: 8

h-index: 2

Harit Vishwakarma

Citations: 25

h-index: 3

Frederic Sala

Citations: 24

h-index: 2

대규모 언어 모델(LLM)의 미세 조정은 일반적으로 대량의 고품질 어노테이션 데이터 또는 강화 학습 기반 검증 보상(RLVR)의 경우, 명확하게 정의된 정답을 가진 질문에 의존합니다. 이전 연구에서는 RLVR에 사용되는 데이터와 컴퓨팅 자원을 늘려 모델의 추론 능력을 향상시키는 효과를 살펴보았지만, 이러한 결과는 어노테이션 데이터와 접근 가능한 컴퓨팅 자원이 부족한 많은 실제 환경에는 적용하기 어렵습니다. 본 연구에서는 오픈 소스 소규모 언어 모델(SLM)이 RLVR을 통해 저데이터 환경에서 어떻게 성능을 보이는지 광범위한 실험적 연구를 수행했습니다. 숫자 세기 문제, 그래프 추론, 공간 추론을 다루는 세 가지 새로운 데이터 세트를 사용하여, 모델 성능이 데이터 세트의 크기, 다양성 및 복잡성에 따라 어떻게 변화하는지 분석했습니다. 우리는 (1) 절차적 데이터 세트를 통해 모델 성능을 세밀하게 평가하고, 제어 가능한 특성(크기, 다양성, 복잡성)을 가진 학습 데이터 세트를 개발할 수 있음을 보여줍니다. (2) RLVR 환경에서, 낮은 복잡성 작업으로 학습된 모델이 더 높은 복잡성 작업에 일반화될 수 있음을 입증합니다. (3) 혼합 복잡성 데이터 세트로 학습하는 것이 저데이터 환경에서 가장 큰 효과를 가져다주며, 쉬운 작업으로 학습하는 것보다 최대 5배 더 효율적인 학습이 가능함을 확인했습니다. 이러한 결과는 RLVR을 위한 데이터 스케일링 법칙 개발 및 효율적인 LLM 미세 조정을 위한 효과적인 데이터 개발 방법을 이해하기 위한 절차적 데이터 생성기 활용에 대한 향후 연구에 영감을 줍니다.

Original Abstract

Fine-tuning Large Language Models (LLMs) typically relies on large quantities of high-quality annotated data, or questions with well-defined ground truth answers in the case of Reinforcement Learning with Verifiable Rewards (RLVR). While previous work has explored the benefits to model reasoning capabilities by scaling both data and compute used for RLVR, these results lack applicability in many real-world settings where annotated data and accessible compute may be scarce. In this work, we present a comprehensive empirical study of open-source Small Language Model (SLM) performance after RLVR in low data regimes. Across three novel datasets covering number counting problems, graph reasoning, and spatial reasoning, we characterize how model performance scales with dataset size, diversity, and complexity. We demonstrate that (1) procedural datasets allow for fine-grained evaluation and training dataset development with controllable properties (size, diversity, and complexity), (2) under RLVR, models trained on lower complexity tasks can generalize to higher complexity tasks, and (3) training on mixed complexity datasets is associated with the greatest benefits in low data regimes, providing up to 5x sample efficiency versus training on easy tasks. These findings inspire future work on the development of data scaling laws for RLVR and the use of procedural data generators to further understand effective data development for efficient LLM fine-tuning.

0 Citations

0 Influential

7.5 Altmetric

37.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!