2601.22975v2 Jan 30, 2026 cs.AI

황금의 거위: 검증 불가능한 인터넷 텍스트로부터 무한한 RLVR 작업을 합성하는 간단한 방법

Golden Goose: A Simple Trick to Synthesize Unlimited RLVR Tasks from Unverifiable Internet Text

Ximing Lu

University of Washington

Citations: 6,237

h-index: 34

David Acuna

Citations: 151

h-index: 4

Jaehun Jung

Citations: 243

h-index: 8

Jian Hu

Citations: 159

h-index: 4

Shizhe Diao

Citations: 986

h-index: 13

Shaokun Zhang

Citations: 25

h-index: 2

Mingjie Liu

Citations: 252

h-index: 7

Hyunwoo Kim

Citations: 32

h-index: 2

Prithviraj Ammanabrolu

Citations: 401

h-index: 6

Jan Kautz

Citations: 37

h-index: 2

Yejin Choi

Citations: 483

h-index: 10

Di Zhang

Citations: 24

h-index: 2

Yi Dong

Citations: 267

h-index: 7

Yunheng Zou

Citations: 46

h-index: 4

Brandon Cui

Citations: 84

h-index: 2

검증 가능한 보상을 활용한 강화 학습(RLVR)은 대규모 언어 모델(LLM)에서 복잡한 추론 능력을 향상시키는 핵심 기술로 자리 잡았습니다. 그러나 기존의 검증 가능한 데이터의 제한성으로 인해 강화 학습의 성능 향상은 훈련 시간이 길어질수록 점점 더 어려워지는 문제가 있습니다. 이러한 문제를 해결하기 위해, 우리는 검증 불가능한 인터넷 텍스트로부터 무한한 RLVR 작업을 합성하는 간단한 방법인 '황금의 거위(Golden Goose)'를 제안합니다. 이 방법은 텍스트 내의 빈칸 채우기(fill-in-the-middle) 작업을 객관식 질문-응답 형태로 변환합니다. 주어진 텍스트에서, LLM은 핵심적인 추론 단계를 식별하고 마스킹한 다음, 다양한 가능성 있는 오답 후보를 생성합니다. 이를 통해 기존의 RLVR 데이터 구축에서 일반적으로 제외되었던, 추론 능력을 풍부하게 포함하는 검증 불가능한 데이터(예: 과학 교과서)를 활용하여 0.7백만 개 이상의 작업으로 구성된 대규모 RLVR 데이터셋인 GooseReason-0.7M을 구축했습니다. 실험 결과, GooseReason은 기존의 RLVR 데이터로 인해 성능이 정체된 모델을 효과적으로 활성화시키고, 지속적인 강화 학습을 통해 꾸준한 성능 향상을 가져왔으며, 1.5B 및 4B-Instruct 모델에 대해 15개의 다양한 벤치마크에서 새로운 최고 성능을 달성했습니다. 마지막으로, 우리는 '황금의 거위'를 실제 환경에 적용하여 사이버 보안 분야의 원시 웹 데이터를 활용하여 RLVR 작업을 합성했습니다. 이렇게 생성된 GooseReason-Cyber 데이터셋으로 Qwen3-4B-Instruct 모델을 훈련한 결과, 광범위한 도메인 특화 사전 훈련 및 사후 훈련을 거친 7B 모델을 능가하는 사이버 보안 분야의 새로운 최고 성능을 달성했습니다. 이는 풍부한 추론 능력을 가진 검증 불가능한 인터넷 텍스트를 활용하여 RLVR 데이터를 자동으로 확장할 수 있는 잠재력을 보여줍니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a cornerstone for unlocking complex reasoning in Large Language Models (LLMs). Yet, scaling up RL is bottlenecked by limited existing verifiable data, where improvements increasingly saturate over prolonged training. To overcome this, we propose Golden Goose, a simple trick to synthesize unlimited RLVR tasks from unverifiable internet text by constructing a multiple-choice question-answering version of the fill-in-the-middle task. Given a source text, we prompt an LLM to identify and mask key reasoning steps, then generate a set of diverse, plausible distractors. This enables us to leverage reasoning-rich unverifiable corpora typically excluded from prior RLVR data construction (e.g., science textbooks) to synthesize GooseReason-0.7M, a large-scale RLVR dataset with over 0.7 million tasks spanning mathematics, programming, and general scientific domains. Empirically, GooseReason effectively revives models saturated on existing RLVR data, yielding robust, sustained gains under continuous RL and achieving new state-of-the-art results for 1.5B and 4B-Instruct models across 15 diverse benchmarks. Finally, we deploy Golden Goose in a real-world setting, synthesizing RLVR tasks from raw FineWeb scrapes for the cybersecurity domain, where no prior RLVR data exists. Training Qwen3-4B-Instruct on the resulting data GooseReason-Cyber sets a new state-of-the-art in cybersecurity, surpassing a 7B domain-specialized model with extensive domain-specific pre-training and post-training. This highlights the potential of automatically scaling up RLVR data by exploiting abundant, reasoning-rich, unverifiable internet text.

4 Citations

0 Influential

17 Altmetric

89.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!