2604.13517v1 Apr 15, 2026 cs.LG

라우팅보다 표현: 다중 시간 척도 PPO에서의 대리 공격 문제를 극복

Representation over Routing: Overcoming Surrogate Hacking in Multi-Timescale PPO

Citations: 32

h-index: 4

강화 학습에서 시간적 보상 할당 문제는 오랫동안 중요한 과제로 여겨져 왔습니다. 신경생물학에서 도파민 시스템의 다중 시간 척도 인코딩에 영감을 받아, 최근 연구에서는 액터-크리틱 아키텍처, 특히 Proximal Policy Optimization (PPO)에 여러 개의 할인 요소를 도입하여 단기 반응과 장기 계획을 균형 있게 조정하고자 했습니다. 그러나 본 논문에서는 복잡한 지연 보상 작업에서 다중 시간 척도 신호를 무분별하게 결합하면 심각한 알고리즘 문제를 야기할 수 있음을 밝힙니다. 저희는 시간적 주의 라우팅 메커니즘을 정책 경사에 노출시키면 대리 목표 공격(surrogate objective hacking)이 발생하며, 경사 기반 불확실성 가중치를 사용하면 되돌릴 수 없는 근시력 저하(irreversible myopic degeneration)가 발생하는 현상을 '시간적 불확실성 역설(Paradox of Temporal Uncertainty)'이라고 명명했습니다. 이러한 문제점을 해결하기 위해, 저희는 Target Decoupling 아키텍처를 제안합니다. 크리틱 측면에서는 다중 시간 척도 예측을 유지하여 보조 표현 학습을 강화하고, 액터 측면에서는 단기 신호를 엄격하게 분리하고, 오직 장기적인 장점만을 기반으로 정책을 업데이트합니다. LunarLander-v2 환경에서 다양한 독립적인 랜덤 시드를 사용하여 수행한 엄격한 실험 결과, 제안된 아키텍처가 통계적으로 유의미한 성능 향상을 달성하는 것으로 나타났습니다. 저희는 하이퍼파라미터 튜닝에 의존하지 않고도, 일관되게 '환경 해결(Environment Solved)' 임계값을 능가하며, 변동성이 매우 낮고, 정책 붕괴를 완전히 방지하며, 단일 시간 척도 기반 모델을 곤경에 빠뜨리는 로컬 최적점에 벗어나는 것을 확인했습니다.

Original Abstract

Temporal credit assignment in reinforcement learning has long been a central challenge. Inspired by the multi-timescale encoding of the dopamine system in neurobiology, recent research has sought to introduce multiple discount factors into Actor-Critic architectures, such as Proximal Policy Optimization (PPO), to balance short-term responses with long-term planning. However, this paper reveals that blindly fusing multi-timescale signals in complex delayed-reward tasks can lead to severe algorithmic pathologies. We systematically demonstrate that exposing a temporal attention routing mechanism to policy gradients results in surrogate objective hacking, while adopting gradient-free uncertainty weighting triggers irreversible myopic degeneration, a phenomenon we term the Paradox of Temporal Uncertainty. To address these issues, we propose a Target Decoupling architecture: on the Critic side, we retain multi-timescale predictions to enforce auxiliary representation learning, while on the Actor side, we strictly isolate short-term signals and update the policy based solely on long-term advantages. Rigorous empirical evaluations across multiple independent random seeds in the LunarLander-v2 environment demonstrate that our proposed architecture achieves statistically significant performance improvements. Without relying on hyperparameter hacking, it consistently surpasses the ''Environment Solved'' threshold with minimal variance, completely eliminates policy collapse, and escapes the hovering local optima that trap single-timescale baselines.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!