2605.02320v1 May 04, 2026 cs.AI

ANO: 견고한 정책 최적화를 위한 원칙 기반 접근 방식

ANO: A Principled Approach to Robust Policy Optimization

Kaiyan Zhao

Citations: 56

h-index: 5

Zhenglin Wan

Citations: 28

h-index: 3

Yiming Wang

Citations: 92

h-index: 6

Jiayu Chen

Citations: 26

h-index: 2

U. LeongHou

Citations: 30

h-index: 2

Yiheng Zhang

Citations: 18

h-index: 2

프록시멀 정책 최적화(PPO)는 딥 강화 학습 분야에서 널리 사용되지만, 근본적인 딜레마에 직면해 있습니다. PPO의 '하드 클리핑' 메커니즘은 이상치에서 유용한 그래디언트 정보를 버려 샘플 효율성을 저하시킵니다. 반대로, 클리핑을 제거하면(SPO의 경우) 최적화 과정이 경계가 없는 그래디언트에 노출되어 심각한 불안정성과 하이퍼파라미터 민감성을 야기합니다. 이러한 문제를 해결하기 위해, 우리는 기존 목적 함수를 일반화하는 통합 신뢰 영역 프레임워크를 제시합니다. 이 프레임워크 내에서, 우리는 설계 원칙에 기반한 앵커드 네이버후드 최적화(ANO)를 도출했습니다. 우리는 표준 정책 그래디언트의 실패 원인이 이상치에 대한 그래디언트 영향의 잘못된 적용에서 비롯된다는 것을 확인했습니다. 우리는 단조적인 페널티(SPO)와 하드 임계값(PPO)에서 벗어나 동적인 이상치 억제를 위한 새로운 패러다임인 '리디센딩 영향 원리'를 제안하고, 이것이 고분산 확률적 최적화에서 안정성을 위한 필수 조건임을 증명했습니다. 이론적으로, ANO는 견고한 최적화를 위해 필요한 최소한의 구조적 복잡성을 갖는다는 것을 증명했습니다. 실험적으로, ANO는 MuJoCo 벤치마크에서 최첨단 성능을 달성했으며, PPO 및 SPO보다 훨씬 뛰어난 성능을 보였습니다. 주목할 만한 점은, ANO는 PPO가 완전히 실패하는 극단적인 하이퍼파라미터(예: 표준보다 3배 큰 학습률)에서도 정책 붕괴를 방지하는 우수한 안정성을 보여줍니다.

Original Abstract

Proximal Policy Optimization (PPO) dominates deep RL but faces a fundamental dilemma. Its "hard clipping" mechanism discards valuable gradient information from outliers, leading to sample inefficiency. Conversely, removing clipping (as in SPO) exposes optimization to unbounded gradients, causing significant instability and hyperparameter sensitivity. To resolve this, we establish a Unified Trust Region Framework that generalizes existing objectives. Within this framework, we derive Anchored Neighborhood Optimization (ANO) based on a set of design principles. We identify that the failure of standard policy gradients stems from a misapplication of gradient influence on outliers. We propose the Redescending Influence Principle, a paradigm shift from monotonic penalties (SPO) and hard-thresholding (PPO) to dynamic outlier suppression, and prove its necessity for stability in high-variance stochastic optimization. Theoretically, we prove ANO possesses the minimal structural complexity required for robust optimization. Empirically, ANO achieves state-of-the-art performance on MuJoCo benchmarks, significantly outperforming PPO and SPO. Notably, ANO demonstrates superior stability, preventing policy collapse even under aggressive hyperparameters (e.g., learning rates 3x larger than standard) where PPO fails completely.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!