2603.04289v1 Mar 04, 2026 cs.LG

IPD: 가상 계획 증류를 통한 순차적 정책 강화 - 오프라인 강화 학습

IPD: Boosting Sequential Policy with Imaginary Planning Distillation in Offline Reinforcement Learning

Peiran Liu

Citations: 14

h-index: 2

Yiding Ji

Citations: 18

h-index: 2

Yihao Qin

Citations: 9

h-index: 2

Hang Zhou

Citations: 11

h-index: 2

Hao Dong

Citations: 157

h-index: 3

Yuanfei Wang

Citations: 27

h-index: 3

결정 트랜스포머 기반의 순차적 정책은 오프라인 강화 학습(RL) 분야에서 강력한 패러다임으로 부상했지만, 그 효과는 여전히 정적인 데이터셋의 품질과 내재적인 구조적 한계에 의해 제약됩니다. 특히, 이러한 모델은 종종 최적이 아닌 경험을 효과적으로 통합하는 데 어려움을 겪으며, 명시적으로 최적 정책을 계획하는 데 실패합니다. 이러한 격차를 해소하기 위해, 본 논문에서는 오프라인 계획을 데이터 생성, 지도 학습 및 온라인 추론에 원활하게 통합하는 새로운 프레임워크인 **가상 계획 증류 (Imaginary Planning Distillation, IPD)**를 제안합니다. 우리 프레임워크는 먼저 오프라인 데이터로부터 불확실성 측정 및 준최적 가치 함수를 갖춘 세계 모델을 학습합니다. 이러한 구성 요소는 최적이 아닌 경로를 식별하고, 모델 예측 제어(MPC)를 통해 생성된 신뢰할 수 있는, 상상된 최적 경로를 추가하는 데 사용됩니다. 그런 다음, 트랜스포머 기반의 순차적 정책은 이 풍부해진 데이터셋에 대해 학습되며, 최적 정책의 증류를 촉진하는 가치 기반의 목표 함수를 추가합니다. IPD는 기존의 수동으로 조정된 미래 수익을 학습된 준최적 가치 함수로 대체함으로써, 의사 결정의 안정성과 추론 성능을 향상시킵니다. D4RL 벤치마크에서의 실험적 결과는 IPD가 다양한 작업에서 최첨단 가치 기반 및 트랜스포머 기반 오프라인 RL 방법보다 훨씬 우수한 성능을 보인다는 것을 보여줍니다.

Original Abstract

Decision transformer based sequential policies have emerged as a powerful paradigm in offline reinforcement learning (RL), yet their efficacy remains constrained by the quality of static datasets and inherent architectural limitations. Specifically, these models often struggle to effectively integrate suboptimal experiences and fail to explicitly plan for an optimal policy. To bridge this gap, we propose \textbf{Imaginary Planning Distillation (IPD)}, a novel framework that seamlessly incorporates offline planning into data generation, supervised training, and online inference. Our framework first learns a world model equipped with uncertainty measures and a quasi-optimal value function from the offline data. These components are utilized to identify suboptimal trajectories and augment them with reliable, imagined optimal rollouts generated via Model Predictive Control (MPC). A Transformer-based sequential policy is then trained on this enriched dataset, complemented by a value-guided objective that promotes the distillation of the optimal policy. By replacing the conventional, manually-tuned return-to-go with the learned quasi-optimal value function, IPD improves both decision-making stability and performance during inference. Empirical evaluations on the D4RL benchmark demonstrate that IPD significantly outperforms several state-of-the-art value-based and transformer-based offline RL methods across diverse tasks.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!