2605.07393v1 May 08, 2026 cs.AI

사후 샘플링 기반 오프라인 정책 최적화

Offline Policy Optimization with Posterior Sampling

Yiding Sun

Citations: 64

h-index: 5

Haijun Zhang

Citations: 74

h-index: 3

Ning Yang

Citations: 28

h-index: 2

Dongxu Zhang

Citations: 76

h-index: 6

Hongqiang Lin

Citations: 24

h-index: 2

Mingzhe Li

Citations: 14

h-index: 2

모델 기반 오프라인 강화 학습(RL)의 핵심적인 과제는 분포 외(out-of-distribution, OOD) 영역에서의 일반화 능력과 모델 오용으로 인한 오류에 대한 강건성 간의 균형을 맞추는 것입니다. OOD 샘플은 유효한 물리적 역학을 포함할 수 있지만, 동시에 모델 오용의 위험을 초래합니다. 기존 방법은 이러한 위험을 완화하기 위해 과도한 비관적 정규화를 사용하는 경우가 많으며, 이는 강건성을 보장하지만 종종 일반화 능력을 저하시킵니다. 이러한 한계를 극복하기 위해, 우리는 사후 샘플링 기반 정책 최적화(PSPO)를 제안합니다. PSPO는 동역학 모델링을 베이지안 추론 과정으로 공식화하여 모델의 신뢰도를 명시적으로 정량화하는 사후 분포를 도출합니다. 사후 샘플링과 제약 조건 정책 최적화를 결합함으로써, PSPO는 동역학적 일관성을 갖는 OOD 전환을 활용하여 일반화 능력을 향상시키면서도 모델 오용에 대한 강건성을 보장합니다. 이론적으로, 우리는 사후 샘플링 하에서의 Q-값 추정 문제를 확률적 근사 문제로 공식화하고, 그 수렴성을 증명합니다. 정책 최적화를 제약 조건이 있는 하위 문제의 시퀀스로 분해하여, 이러한 하위 문제들을 해결하는 것이 수렴될 때까지 단조적으로 성능이 향상됨을 보여줍니다. 표준 벤치마크에서의 실험 결과는 PSPO가 최첨단(state-of-the-art) 기준 모델보다 우수한 성능을 달성함을 입증합니다.

Original Abstract

A fundamental challenge in model-based offline reinforcement learning (RL) lies in the trade-off between generalization and robustness against exploitation errors in out-of-distribution (OOD) regions. While OOD samples may capture valid underlying physical dynamics, they also introduce the risk of model exploitation. Existing methods typically address this risk through excessive pessimistic regularization, which ensures robustness but often sacrifices generalization. To overcome this limitation, we propose Posterior Sampling-based Policy Optimization (PSPO), which formulates dynamics modeling as a Bayesian inference process to derive a posterior that explicitly quantifies model fidelity. Through the integration of posterior sampling and constrained policy optimization, our method leverages dynamics-consistent OOD transitions for generalization while ensuring robustness against model exploitation. Theoretically, we formulate Q-value estimation under posterior sampling as a stochastic approximation problem and establish its convergence. We decompose policy optimization into a sequence of constrained subproblems, demonstrating that solving these subproblems guarantees monotonic improvement until convergence. Experiments on standard benchmarks validate that PSPO achieves superior performance compared to state-of-the-art baselines.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!