2602.00173v1 Jan 30, 2026 cs.LG

안내된 적대적 자기 학습을 통한 강건한 추론 능력 학습

Learning Robust Reasoning through Guided Adversarial Self-Play

Liu Leqi

Citations: 540

h-index: 12

Lizhang Chen

Citations: 37

h-index: 2

Shuozhe Li

Citations: 24

h-index: 3

Vaishnav Tadiparthi

Citations: 123

h-index: 7

Kwonjoon Lee

Citations: 28

h-index: 2

Nakul Agarwal

Citations: 489

h-index: 12

Hossein Nourkhiz Mahjoub

Citations: 502

h-index: 12

Amy Zhang

Citations: 11

h-index: 2

Ehsan Moradi-Pari

Citations: 512

h-index: 12

검증 가능한 보상을 이용한 강화 학습(RLVR)은 강력한 추론 모델을 생성하지만, 조건부 맥락이 불완전할 때(예: 손상된 사고 과정, 오해를 불러일으키는 부분적인 해결책, 또는 경미한 입력 변화) 심각한 오류를 발생시킬 수 있습니다. 이는 표준 RLVR이 깨끗한 조건 하에서만 최종 답변의 정확성을 최적화하기 때문입니다. 본 논문에서는 GASP(Guided Adversarial Self-Play)라는 강건성 향상 방법을 제시합니다. GASP는 결과 검증만을 사용하여 명시적으로 오류 감지 및 수정 능력을 학습시킵니다. GASP는 인간 레이블이나 외부 지도 없이 단일 모델 내에서 적대적 자기 학습 게임을 구성합니다. 한 모델은 '오염자' 역할을 맡아, 지역적으로 일관된 방식으로 오류를 유발하는 반면, '에이전트'는 동일한 손상된 조건 하에서 오류를 진단하고 복구하는 방법을 학습합니다. 학습 초기에 성공적인 복구 사례가 부족한 문제를 해결하기 위해, 우리는 '분포 내 복구 가이드'를 제안합니다. 이는 자기 생성된 복구에 대한 모방 용어로, 복구 확률을 높이는 동시에 이전에 획득한 능력을 유지합니다. 1.5B에서 8B 파라미터의 네 가지 공개 모델에 대해, GASP는 강력하지만 취약한 추론 모델을 강건한 모델로 변환하여, 오해를 불러일으키거나 변경된 맥락에 잘 대응하면서 동시에 깨끗한 조건에서의 정확도를 향상시킵니다. 추가 분석 결과, 적대적인 손상은 효과적인 학습 커리큘럼을 유도하며, 분포 내 가이드는 최소한의 표현 변화로 빠른 복구 학습을 가능하게 합니다.

Original Abstract

Reinforcement learning from verifiable rewards (RLVR) produces strong reasoning models, yet they can fail catastrophically when the conditioning context is fallible (e.g., corrupted chain-of-thought, misleading partial solutions, or mild input perturbations), since standard RLVR optimizes final-answer correctness only under clean conditioning. We introduce GASP (Guided Adversarial Self-Play), a robustification method that explicitly trains detect-and-repair capabilities using only outcome verification. Without human labels or external teachers, GASP forms an adversarial self-play game within a single model: a polluter learns to induce failure via locally coherent corruptions, while an agent learns to diagnose and recover under the same corrupted conditioning. To address the scarcity of successful recoveries early in training, we propose in-distribution repair guidance, an imitation term on self-generated repairs that increases recovery probability while preserving previously acquired capabilities. Across four open-weight models (1.5B--8B), GASP transforms strong-but-brittle reasoners into robust ones that withstand misleading and perturbed context while often improving clean accuracy. Further analysis shows that adversarial corruptions induce an effective curriculum, and in-distribution guidance enables rapid recovery learning with minimal representational drift.

2 Citations

0 Influential

6 Altmetric

32.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!