2604.01765v1 Apr 02, 2026 cs.CV

DriveDreamer-Policy: 기하학 기반의 통합 생성 및 계획을 위한 세계-행동 모델

DriveDreamer-Policy: A Geometry-Grounded World-Action Model for Unified Generation and Planning

Zhengqiu Zhu

Citations: 16

h-index: 3

Hao Shao

Citations: 46

h-index: 3

Letian Wang

Citations: 1,509

h-index: 10

Steven L. Waslander

Citations: 23

h-index: 3

Guosheng Zhao

Citations: 735

h-index: 14

Jiangnan Shao

Citations: 64

h-index: 3

Jiagang Zhu

Citations: 1,100

h-index: 14

Ting Yu

Citations: 6

h-index: 2

Guan Huang

Citations: 1,176

h-index: 17

Yang Zhou

Citations: 145

h-index: 5

Xiaofeng Wang

Citations: 1,791

h-index: 17

최근, 세계-행동 모델(WAM)은 시각-언어-행동(VLA) 모델과 세계 모델을 연결하여 추론 및 지시 따르기 능력과 시공간 세계 모델링을 통합하는 방식으로 등장했습니다. 그러나 기존의 WAM 접근 방식은 종종 2D 외관 또는 잠재적 표현을 모델링하는 데 중점을 두며, 물리적 세계에서 작동하는 임베디드 시스템에 필수적인 기하학적 정보의 활용이 제한적입니다. 본 논문에서는 깊이 생성, 미래 비디오 생성 및 모션 계획을 단일 모듈식 아키텍처 내에 통합하는 통합 운전 세계-행동 모델인 DriveDreamer-Policy를 제시합니다. 이 모델은 대규모 언어 모델을 사용하여 언어 지시, 다중 뷰 이미지 및 행동을 처리하고, 이어서 깊이, 미래 비디오 및 행동을 생성하는 세 개의 경량 생성기를 사용합니다. 제안된 모델은 기하학적 정보를 포함하는 세계 표현을 학습하고 이를 사용하여 통합 프레임워크 내에서 미래 예측 및 계획을 안내함으로써, 더욱 일관성 있는 미래 예측과 더 정확한 운전 행동을 생성하며, 동시에 모듈성과 제어 가능한 지연 시간을 유지합니다. Navsim v1 및 v2 벤치마크에서 수행된 실험 결과, DriveDreamer-Policy는 폐루프 계획 및 세계 생성 작업 모두에서 뛰어난 성능을 보였습니다. 특히, Navsim v1에서 89.2의 PDMS, Navsim v2에서 88.7의 EPDMS를 달성하여 기존의 세계 모델 기반 접근 방식보다 우수한 성능을 보였으며, 동시에 더 높은 품질의 미래 비디오 및 깊이 예측을 생성했습니다. 추가적인 분석을 통해 명시적인 깊이 학습이 비디오 상상력에 보완적인 이점을 제공하고 계획의 견고성을 향상시키는 것을 확인했습니다.

Original Abstract

Recently, world-action models (WAM) have emerged to bridge vision-language-action (VLA) models and world models, unifying their reasoning and instruction-following capabilities and spatio-temporal world modeling. However, existing WAM approaches often focus on modeling 2D appearance or latent representations, with limited geometric grounding-an essential element for embodied systems operating in the physical world. We present DriveDreamer-Policy, a unified driving world-action model that integrates depth generation, future video generation, and motion planning within a single modular architecture. The model employs a large language model to process language instructions, multi-view images, and actions, followed by three lightweight generators that produce depth, future video, and actions. By learning a geometry-aware world representation and using it to guide both future prediction and planning within a unified framework, the proposed model produces more coherent imagined futures and more informed driving actions, while maintaining modularity and controllable latency. Experiments on the Navsim v1 and v2 benchmarks demonstrate that DriveDreamer-Policy achieves strong performance on both closed-loop planning and world generation tasks. In particular, our model reaches 89.2 PDMS on Navsim v1 and 88.7 EPDMS on Navsim v2, outperforming existing world-model-based approaches while producing higher-quality future video and depth predictions. Ablation studies further show that explicit depth learning provides complementary benefits to video imagination and improves planning robustness.

3 Citations

0 Influential

8.5 Altmetric

45.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!