2601.21991v1 Jan 29, 2026 cs.LG

경로 적분 안정성 증명 기반의 이동 MDP (Drifting MDP)의 기하학적 분석

Geometry of Drifting MDPs with Path-Integral Stability Certificates

Zuyuan Zhang

Citations: 130

h-index: 7

Tian Lan

Citations: 63

h-index: 6

Mahdi Imani

Citations: 107

h-index: 6

실제 강화 학습 환경은 종종 extit{비정상적}입니다. 보상과 동역학이 변화하고, 가속화되며, 진동하고, 최적 행동에 갑작스러운 변화를 유발합니다. 기존 이론은 종종 환경의 변화 extit{량}을 측정하는 거칠고 일반적인 모델을 사용하여 비정상성을 표현하지만, 환경의 extit{방식}을 정확하게 반영하지 못합니다. 특히, 가속화 및 거의 동일한 상태는 추적 오류와 정책의 불안정성을 초래합니다. 본 연구에서는 환경을 미분 가능한 호모토피 경로로 모델링하고, 최적 벨만 고정점의 움직임을 추적함으로써, 비정상적인 할인된 마르코프 결정 프로세스(MDP)의 기하학적 관점을 제시합니다. 이를 통해 누적 드리프트, 가속/진동, 그리고 행동 간격으로 인한 불연속성을 나타내는 길이-곡률- kinks 특징을 분석합니다. 우리는 솔버에 독립적인 경로 적분 안정성 경계를 증명하고, 스위치 영역에서 로컬 안정성을 보장하는 안전 영역을 도출합니다. 이러한 결과를 바탕으로, 온라인으로 길이, 곡률, 그리고 거의 동일한 상태의 근접성을 추정하고, 이에 따라 학습 또는 계획 강도를 조정하는 경량 래퍼인 extit{호모토피 추적 강화 학습 (HT-RL)} 및 extit{HT-MCTS}를 소개합니다. 실험 결과, HT-RL 및 HT-MCTS는 동일한 정적 기준선에 비해 추적 성능과 동적 후회를 개선했으며, 특히 진동 및 스위치 발생 가능성이 높은 환경에서 더 큰 성능 향상을 보였습니다.

Original Abstract

Real-world reinforcement learning is often \emph{nonstationary}: rewards and dynamics drift, accelerate, oscillate, and trigger abrupt switches in the optimal action. Existing theory often represents nonstationarity with coarse-scale models that measure \emph{how much} the environment changes, not \emph{how} it changes locally -- even though acceleration and near-ties drive tracking error and policy chattering. We take a geometric view of nonstationary discounted Markov Decision Processes (MDPs) by modeling the environment as a differentiable homotopy path and tracking the induced motion of the optimal Bellman fixed point. This yields a length-curvature-kink signature of intrinsic complexity: cumulative drift, acceleration/oscillation, and action-gap-induced nonsmoothness. We prove a solver-agnostic path-integral stability bound and derive gap-safe feasible regions that certify local stability away from switch regimes. Building on these results, we introduce \textit{Homotopy-Tracking RL (HT-RL)} and \textit{HT-MCTS}, lightweight wrappers that estimate replay-based proxies of length, curvature, and near-tie proximity online and adapt learning or planning intensity accordingly. Experiments show improved tracking and dynamic regret over matched static baselines, with the largest gains in oscillatory and switch-prone regimes.

6 Citations

0 Influential

3.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!