2604.17696v1 Apr 20, 2026 cs.AI

STRATAGEM: 경로 변조 게임 자가 학습을 통한 일반화된 추론 능력 획득

Stratagem: Learning Transferable Reasoning via Trajectory-Modulated Game Self-Play

Yilei Jiang

Citations: 215

h-index: 9

Lei Huang

Citations: 3,246

h-index: 10

Weitao Ma

Citations: 3,228

h-index: 10

Xiaocheng Feng

Citations: 10,669

h-index: 31

Bing Qin

Citations: 3,455

h-index: 12

Xiachong Feng

Citations: 837

h-index: 15

Yuxuan Gu

Citations: 31

h-index: 3

Qiming Li

Citations: 72

h-index: 3

Lingpeng Kong

Citations: 351

h-index: 7

Deyi Yin

Citations: 0

h-index: 0

Libo Qin

Citations: 39

h-index: 3

Yangfan Ye

Citations: 83

h-index: 5

게임은 언어 모델에서 일반적인 추론 능력을 개발하는 데 매력적인 패러다임을 제공합니다. 게임은 자연스럽게 전략적 계획, 확률적 추론 및 적응적 의사 결정을 요구하기 때문입니다. 그러나 기존의 자가 학습 접근 방식은 최종 게임 결과에만 의존하며, 일반화 가능한 추론 패턴과 게임 특정적인 휴리스틱을 구별할 수 있는 메커니즘을 제공하지 않습니다. 본 논문에서는 STRATAGEM을 제시합니다. STRATAGEM은 추론 일반화의 두 가지 근본적인 장벽을 해결합니다. 첫째, 학습된 패턴이 게임의 의미론에 묶여 있는 도메인 특이성 문제를 해결하고, 둘째, 정적인 게임 컨텍스트가 점진적인 추론 발달을 촉진하지 못하는 컨텍스트 정체 문제를 해결합니다. STRATAGEM은 추론 일반화 계수를 사용하여 추상적이고 도메인에 독립적인 추론을 보이는 경로를 선택적으로 강화하고, 추론 발전 보상을 통해 적응적인 추론 개발을 장려합니다. 수학적 추론, 일반적인 추론 및 코드 생성 벤치마크에 대한 실험 결과, 상당한 성능 향상이 확인되었으며, 특히 다단계 추론이 중요한 수준의 수학 문제에서 괄목할 만한 발전이 있었습니다. 제거 실험 및 인간 평가를 통해 두 가지 구성 요소 모두 추론 일반화에 기여하는 것으로 확인되었습니다.

Original Abstract

Games offer a compelling paradigm for developing general reasoning capabilities in language models, as they naturally demand strategic planning, probabilistic inference, and adaptive decision-making. However, existing self-play approaches rely solely on terminal game outcomes, providing no mechanism to distinguish transferable reasoning patterns from game-specific heuristics. We present STRATAGEM, which addresses two fundamental barriers to reasoning transfer: domain specificity, where learned patterns remain anchored in game semantics, and contextual stasis, where static game contexts fail to cultivate progressive reasoning. STRATAGEM selectively reinforces trajectories exhibiting abstract, domain-agnostic reasoning through a Reasoning Transferability Coefficient, while incentivizing adaptive reasoning development via a Reasoning Evolution Reward. Experiments across mathematical reasoning, general reasoning, and code generation benchmarks demonstrate substantial improvements, with particularly strong gains on competition-level mathematics where multi-step reasoning is critical. Ablation studies and human evaluation confirm that both components contribute to transferable reasoning.

0 Citations

0 Influential

15.5 Altmetric

77.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!