2602.05717v1 Feb 05, 2026 cs.AI

앵커형 정책 최적화: 서포트 제약 보정을 통한 탐색 붕괴 완화

Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification

Yixia Li

Southern University of Science and Technology

Citations: 141

h-index: 6

Guanhua Chen

Citations: 49

h-index: 1

Yong Wang

Citations: 140

h-index: 2

Tianyi Wang

Citations: 50

h-index: 4

Long Li

Citations: 52

h-index: 4

Hongcan Guo

Citations: 86

h-index: 4

Yibiao Chen

Citations: 20

h-index: 3

Yun Chen

Citations: 7

h-index: 1

검증 가능한 보상을 활용한 강화학습(RLVR)은 점차 트리 가지치기 메커니즘으로 간주되고 있다. 그러나 본 연구에서는 '재귀적 공간 수축(RSC)'이라는 구조적 병리 현상을 규명한다. 이는 긍정적 샤프닝(positive sharpening)과 부정적 스퀴징(negative squeezing)의 결합된 역학에 의해 발생하는 비가역적 붕괴 현상으로, 이로 인해 유효한 대안들의 샘플링 확률이 소멸하게 된다. 쿨백-라이블러(KL) 정규화가 이를 완화하려 하지만, 이는 정책이 참조 모델의 전체 밀도를 모방하도록 강제하는 엄격한 '형상 일치(Shape Matching)' 제약을 부과하여, 정확성을 위해 요구되는 샤프닝과 그래디언트 충돌을 일으킨다. 이에 우리는 전역적 형상 일치에서 '서포트 커버리지(Support Coverage)'로 패러다임을 전환하는 앵커형 정책 최적화(APO)를 제안한다. 참조 모델의 고신뢰 서포트에 기반한 '안전 매니폴드(Safe Manifold)'를 정의함으로써, APO는 효율성을 위한 공격적인 샤프닝을 허용하면서도 붕괴를 막기 위해 오류 수정 중 선택적으로 복원력을 발동시킨다. 우리는 APO가 서포트 커버리지를 극대화하는 그래디언트 정렬 메커니즘으로 작용하여, 유효한 분기들을 재확장하는 '탄력적 회복(Elastic Recovery)'을 가능하게 함을 이론적으로 도출한다. 수학 벤치마크에 대한 실증적 평가 결과, APO는 정확도와 다양성 간의 트레이드오프를 타파하고 Pass@1 성능을 크게 향상시키는 동시에 표준 정책 그래디언트 방법에서 통상적으로 손실되는 Pass@K 다양성을 복원함을 입증했다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.

3 Citations

0 Influential

3 Altmetric

18.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!