2602.05472v1 Feb 05, 2026 cs.AI

ALIVE: 적대적 학습과 교훈적 언어 평가를 통한 LLM 추론 능력 일깨우기

ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation

Jing Ye

Citations: 0

h-index: 0

Yiwen Duan

Citations: 3

h-index: 1

Xinpei Zhao

Citations: 25

h-index: 2

대규모 언어 모델(LLM)의 전문가 수준 추론 능력 확보는 지속적인 '보상 병목 현상(reward bottleneck)'으로 인해 난항을 겪어왔다. 전통적인 강화 학습(RL)은 확장 비용이 높고, 도메인 간 적응력이 떨어지며(brittle), 솔루션의 기저 논리를 파악하지 못하는 스칼라 보상에 의존하기 때문이다. 이처럼 외부의 빈약한 신호에 의존하는 방식은 모델이 추론 원리에 대한 깊이 있고 자립적인 이해를 발전시키는 것을 저해한다. 우리는 스칼라 보상 최적화를 넘어 내재적 추론 습득을 지향하는 무개입(hands-free) 정렬 프레임워크인 ALIVE(적대적 학습과 교훈적 언어 평가)를 소개한다. '인지적 시너지(Cognitive Synergy)' 원리에 기반한 ALIVE는 문제 제기, 해결, 평가 과정을 단일 정책 모델 내에 통합하여 정답의 논리를 내재화한다. 적대적 학습과 교훈적 언어 피드백을 결합함으로써, ALIVE는 모델이 원시 말뭉치(raw corpora)로부터 평가 기준을 직접 체득하게 하고, 외부의 비평을 효과적으로 내생적인 추론 능력으로 전환시킨다. 수학적 추론, 코드 생성, 일반 논리 추론 벤치마크에 걸친 실증적 평가는 ALIVE가 보상 신호의 한계를 일관되게 완화함을 입증한다. 동일한 데이터와 연산 자원 하에서, ALIVE는 정확도 향상, 현저히 개선된 도메인 간 일반화, 그리고 더 높은 자기 수정률을 달성했다. 이러한 결과는 '추론의 삼위일체'가 능력 성장의 자생적 궤도를 촉진함을 시사하며, 인간의 감독 없이도 범용 추론 정렬을 가능하게 하는 확장 가능한 토대로서 ALIVE의 입지를 공고히 한다.

Original Abstract

The quest for expert-level reasoning in Large Language Models (LLMs) has been hampered by a persistent \textit{reward bottleneck}: traditional reinforcement learning (RL) relies on scalar rewards that are \textbf{costly} to scale, \textbf{brittle} across domains, and \textbf{blind} to the underlying logic of a solution. This reliance on external, impoverished signals prevents models from developing a deep, self-contained understanding of reasoning principles. We introduce \textbf{ALIVE} (\emph{Adversarial Learning with Instructive Verbal Evaluation}), a hands-free alignment framework that moves beyond scalar reward optimization toward intrinsic reasoning acquisition. Grounded in the principle of \emph{Cognitive Synergy}, ALIVE unifies problem posing, solving, and judging within a single policy model to internalize the logic of correctness. By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora, effectively transforming external critiques into an endogenous reasoning faculty. Empirical evaluations across mathematical reasoning, code generation, and general logical inference benchmarks demonstrate that ALIVE consistently mitigates reward signal limitations. With identical data and compute, it achieves accuracy gains, markedly improved cross-domain generalization, and higher self-correction rates. These results indicate that the reasoning trinity fosters a self-sustaining trajectory of capability growth, positioning ALIVE as a scalable foundation for general-purpose reasoning alignment without human-in-the-loop supervision.

0 Citations

0 Influential

1 Altmetric

5.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!