2602.02900v1 Feb 02, 2026 cs.LG

오프라인 강화 학습을 위한 다양체 제약 에너지 기반 전환 모델

Manifold-Constrained Energy-Based Transition Models for Offline Reinforcement Learning

Zuyuan Zhang

Citations: 85

h-index: 7

Tian Lan

Citations: 32

h-index: 3

Zeyu Fang

Citations: 34

h-index: 3

Mahdi Imani

Citations: 77

h-index: 6

모델 기반 오프라인 강화 학습은 데이터 분포의 변화에 취약합니다. 정책 개선은 롤아웃을 데이터셋에서 충분히 지원되지 않는 상태-행동 영역으로 이끌고, 이는 모델 오차가 누적되어 심각한 가치 과대 추정으로 이어집니다. 본 연구에서는 다양체 제약 에너지 기반 전환 모델(MC-ETM)을 제안합니다. MC-ETM은 조건부 에너지 기반 전환 모델을 학습하며, 다양체 투영-확산 음수 샘플링 방법을 사용합니다. MC-ETM은 다음 상태의 잠재 다양체를 학습하고, 학습된 조건부 에너지로 잠재 코드를 교란하고 잠재 공간에서 Langevin 역학을 수행하여 다양체 근처의 음수 샘플을 생성합니다. 이를 통해 데이터셋 주변의 에너지 지형을 더욱 뚜렷하게 만들고, 미묘한 데이터 분포 변화에 대한 민감도를 향상시킵니다. 정책 최적화 과정에서 학습된 에너지는 하나의 신뢰성 신호로 사용됩니다. 롤아웃은 샘플링된 다음 상태에 대한 최소 에너지가 특정 임계값을 초과하면 중단되며, 에너지 기반 샘플에 따른 Q 값의 분산을 기반으로 한 비관적인 페널티를 통해 Bellman 업데이트를 안정화합니다. 본 연구에서는 MC-ETM을 하이브리드 비관적 MDP 프레임워크로 공식화하고, 데이터셋 내 평가 오류와 절단 위험을 분리하는 보수적인 성능 경계를 도출합니다. 실험 결과, MC-ETM은 다단계 동역학의 정확도를 향상시키고, 표준 오프라인 제어 벤치마크에서 더 높은 정규화된 수익을 달성하며, 특히 불규칙한 동역학과 희소한 데이터 분포 환경에서 뛰어난 성능을 보입니다.

Original Abstract

Model-based offline reinforcement learning is brittle under distribution shift: policy improvement drives rollouts into state--action regions weakly supported by the dataset, where compounding model error yields severe value overestimation. We propose Manifold-Constrained Energy-based Transition Models (MC-ETM), which train conditional energy-based transition models using a manifold projection--diffusion negative sampler. MC-ETM learns a latent manifold of next states and generates near-manifold hard negatives by perturbing latent codes and running Langevin dynamics in latent space with the learned conditional energy, sharpening the energy landscape around the dataset support and improving sensitivity to subtle out-of-distribution deviations. For policy optimization, the learned energy provides a single reliability signal: rollouts are truncated when the minimum energy over sampled next states exceeds a threshold, and Bellman backups are stabilized via pessimistic penalties based on Q-value-level dispersion across energy-guided samples. We formalize MC-ETM through a hybrid pessimistic MDP formulation and derive a conservative performance bound separating in-support evaluation error from truncation risk. Empirically, MC-ETM improves multi-step dynamics fidelity and yields higher normalized returns on standard offline control benchmarks, particularly under irregular dynamics and sparse data coverage.

3 Citations

0 Influential

3.5 Altmetric

20.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!