2603.09344v1 Mar 10, 2026 cs.AI

전이 불확실성 하에서의 강력한 정규화 정책 반복

Robust Regularized Policy Iteration under Transition Uncertainty

Yiding Sun

Citations: 51

h-index: 5

Dongxu Zhang

Citations: 65

h-index: 6

Hongqiang Lin

Citations: 19

h-index: 2

Wei-Ting Tang

Citations: 37

h-index: 4

Pengfei Wang

Citations: 23

h-index: 3

Qixian Huang

Citations: 2

h-index: 1

Zhe Fu

Citations: 6

h-index: 1

오프라인 강화 학습(RL)은 온라인 탐색 없이 데이터 효율적이고 안전한 정책 학습을 가능하게 하지만, 분포 변화(distribution shift) 하에서 성능이 저하되는 경향이 있습니다. 학습된 정책은 값 추정 및 학습된 동역학이 신뢰할 수 없는 상태-행동 쌍을 방문할 수 있습니다. 정책으로 인한 외삽(extrapolation) 및 전이 불확실성을 통합적인 프레임워크로 해결하기 위해, 오프라인 RL을 강력한 정책 최적화 문제로 공식화합니다. 여기서 전이 커널(transition kernel)을 불확실성 집합 내의 결정 변수로 간주하고, 정책을 최악의 동역학에 대해 최적화합니다. 우리는 Robust Regularized Policy Iteration (RRPI)을 제안합니다. RRPI는 풀기 어려운 최대-최소 양면 최적화 문제를, 풀기 쉬운 KL 정규화된 대리 함수(surrogate)로 대체하고, 강력한 정규화된 벨만 연산자를 기반으로 효율적인 정책 반복 절차를 유도합니다. 제안된 연산자가 γ-수축(γ-contraction)임을 보여주고, 대리 함수를 반복적으로 업데이트하면 원래의 강력한 목적 함수가 수렴하면서 단조적으로 개선됨을 증명함으로써 이론적 보장을 제공합니다. D4RL 벤치마크에서의 실험 결과, RRPI는 평균적으로 뛰어난 성능을 보이며, PMDB와 같은 백분위수 기반 방법들을 포함한 최근의 기준 모델들을 대부분의 환경에서 능가하고, 나머지 환경에서도 경쟁력 있는 성능을 유지합니다. 또한, RRPI는 강력한 성능을 보입니다. 학습된 Q-값은 불확실성이 높은 영역에서 감소하는데, 이는 결과적으로 생성된 정책이 전이 불확실성 하에서 신뢰할 수 없는 상태-행동 조합을 회피한다는 것을 시사합니다.

Original Abstract

Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods such as PMDB on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust behavior. The learned $Q$-values decrease in regions with higher epistemic uncertainty, suggesting that the resulting policy avoids unreliable out-of-distribution actions under transition uncertainty.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!