2505.03335 May 06, 2025 cs.AI

Absolute Zero: 제로 데이터 기반의 강화된 셀프 플레이 추론

Absolute Zero: Reinforced Self-play Reasoning with Zero Data

Andrew Zhao

Tsinghua University

Citations: 2,473

h-index: 12

Yiran Wu

Citations: 3,603

h-index: 10

Yang Yue

Citations: 2,049

h-index: 13

Tong Wu

Citations: 416

h-index: 6

Quentin Xu

Citations: 882

h-index: 3

Matthieu Lin

Citations: 1,183

h-index: 10

Shenzhi Wang

Citations: 1,153

h-index: 12

Qingyun Wu

Citations: 480

h-index: 5

Zilong Zheng

UCLA

Citations: 1,650

h-index: 17

Gao Huang

Citations: 292

h-index: 4

검증 가능한 보상(RLVR)을 활용한 강화 학습은 결과 기반 보상을 통해 직접 학습함으로써 대규모 언어 모델의 추론 능력을 향상시키는 데 있어 유망한 결과를 보여주었습니다. 제로(zero) 설정에서 작동하는 최근의 RLVR 연구들은 추론 과정 라벨링에 대한 감독은 피하고 있지만, 여전히 훈련을 위해 수동으로 선별된 질문과 답변 데이터셋에 의존하고 있습니다. 인간이 작성한 고품질 예시의 희소성은 인간의 감독에 의존하는 방식의 장기적인 확장성에 우려를 제기하며, 이는 언어 모델 사전 학습 영역에서 이미 명백히 드러난 과제입니다. 게다가 AI가 인간의 지능을 능가하는 가상의 미래에는, 인간이 제공하는 과제가 초지능 시스템에게 제한적인 학습 잠재력만을 제공할 수도 있습니다. 이러한 우려를 해결하기 위해, 우리는 외부 데이터에 전혀 의존하지 않고 단일 모델이 자신의 학습 진척도를 극대화하는 과제를 스스로 제안하고 이를 해결함으로써 추론 능력을 향상시키는 새로운 RLVR 패러다임인 Absolute Zero를 제안합니다. 이 패러다임 하에서 우리는 Absolute Zero Reasoner(AZR)를 소개합니다. AZR은 코드 실행기를 사용하여 제안된 코드 추론 과제의 유효성을 검사하고 답변을 검증함으로써 훈련 커리큘럼과 추론 능력을 자가 진화시키는 시스템으로, 개방적이면서도 근거 있는 학습을 유도하는 검증 가능한 보상의 통합된 원천 역할을 수행합니다. 외부 데이터 없이 완전히 훈련되었음에도 불구하고, AZR은 코딩 및 수학적 추론 과제에서 전반적으로 최고 성능(SOTA)을 달성했으며, 수만 개의 도메인 내 인간 선별 예시에 의존하는 기존 제로 설정 모델들을 능가했습니다. 또한, 우리는 AZR이 다양한 모델 규모에 걸쳐 효과적으로 적용될 수 있으며 다양한 모델 클래스와 호환됨을 입증합니다.

Original Abstract

Reinforcement learning with verifiable rewards (RLVR) has shown promise in enhancing the reasoning capabilities of large language models by learning directly from outcome-based rewards. Recent RLVR works that operate under the zero setting avoid supervision in labeling the reasoning process, but still depend on manually curated collections of questions and answers for training. The scarcity of high-quality, human-produced examples raises concerns about the long-term scalability of relying on human supervision, a challenge already evident in the domain of language model pretraining. Furthermore, in a hypothetical future where AI surpasses human intelligence, tasks provided by humans may offer limited learning potential for a superintelligent system. To address these concerns, we propose a new RLVR paradigm called Absolute Zero, in which a single model learns to propose tasks that maximize its own learning progress and improves reasoning by solving them, without relying on any external data. Under this paradigm, we introduce the Absolute Zero Reasoner (AZR), a system that self-evolves its training curriculum and reasoning ability by using a code executor to both validate proposed code reasoning tasks and verify answers, serving as an unified source of verifiable reward to guide open-ended yet grounded learning. Despite being trained entirely without external data, AZR achieves overall SOTA performance on coding and mathematical reasoning tasks, outperforming existing zero-setting models that rely on tens of thousands of in-domain human-curated examples. Furthermore, we demonstrate that AZR can be effectively applied across different model scales and is compatible with various model classes.

258 Citations

20 Influential

8.5 Altmetric

340.5 Score

Original PDF

AI Analysis

Korean Summary

본 논문은 대규모 언어 모델(LLM)의 추론 능력을 향상시키기 위해 인간이 생성한 데이터나 프롬프트에 전혀 의존하지 않는 'Absolute Zero'라는 새로운 강화학습(RL) 패러다임을 제안합니다. 저자들은 모델이 스스로 학습 난이도에 맞는 문제를 생성(Proposer)하고 이를 해결(Solver)하는 자기 놀이(Self-play) 방식의 'Absolute Zero Reasoner (AZR)'를 개발했습니다. AZR은 파이썬 코드 실행기를 환경으로 활용하여 연역(Deduction), 귀납(Induction), 귀추(Abduction)의 세 가지 추론 모드를 학습하며, 실행 결과를 통해 확실한 보상(Verifiable Reward)을 받습니다. 실험 결과, 외부 데이터 없이 학습된 AZR은 수만 개의 인간 데이터로 학습된 기존 모델보다 코딩 및 수학 추론 벤치마크에서 더 우수한 성능을 달성했습니다.

Key Innovations

외부 데이터(프롬프트, 정답) 0%로 학습하는 Absolute Zero 패러다임
코드 실행기를 활용한 검증 가능한 보상(Verifiable Reward) 기반의 자기 놀이(Self-play) 학습
세 가지 핵심 추론 모드(연역, 귀납, 귀추)를 통합한 커리큘럼
학습 가능성(Learnability)을 극대화하는 문제를 생성하도록 설계된 보상 함수
멀티태스크 환경에 최적화된 Task-Relative REINFORCE++ 알고리즘 도입

Learning & Inference Impact

학습 측면에서는 인간이 만든 고품질 데이터셋의 확보라는 병목 현상을 제거하여 무한한 확장성을 제공합니다. 모델은 스스로 난이도를 조절하며 문제를 생성하므로, 단순 암기가 아닌 일반화된 추론 능력을 습득하게 됩니다. 특히 코딩 도메인에서의 학습이 수학적 추론 능력으로 전이(Transfer)되는 강력한 범용성을 입증했습니다. 추론 측면에서는 모델이 복잡한 문제를 해결하기 위해 코드 주석을 활용하여 사고 과정을 계획(Planning)하거나 시행착오(Trial-and-error)를 겪으며 자체 수정하는 '시스템 2' 사고 능력이 자연스럽게 발현되는 효과를 가져왔습니다.

Technical Difficulty

고급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!