2603.19979v1 Mar 20, 2026 cs.CV

X-World: 제어 가능한 자가 중심 다중 카메라 기반 시뮬레이션 환경 모델을 활용한 확장 가능한 엔드 투 엔드 자율 주행

X-World: Controllable Ego-Centric Multi-Camera World Models for Scalable End-to-End Driving

Xianming Liu

Citations: 32

h-index: 3

Chaoda Zheng

Citations: 1,898

h-index: 13

Sean Li

Citations: 23

h-index: 2

Jin-Sheng Deng

Citations: 33

h-index: 2

Zhennan Wang

Citations: 14

h-index: 3

Shijia Chen

Citations: 9

h-index: 2

Liqiang Xiao

Citations: 40

h-index: 3

Ziheng Chi

Citations: 13

h-index: 3

Hongbin Lin

Citations: 39

h-index: 3

Kangjie Chen

Citations: 50

h-index: 3

Boyang Wang

Citations: 171

h-index: 3

Yu Zhang

Citations: 10

h-index: 2

자율 주행 분야에서 비전-언어-행동(VLA) 정책이 센서 데이터를 직접 주행 행동으로 변환하는 엔드 투 엔드 방식이 중요해짐에 따라, 확장 가능하고 신뢰성 있는 평가가 더욱 중요해지고 있습니다. 그러나 현재 평가 시스템은 여전히 실제 도로 주행 테스트에 크게 의존하고 있으며, 이는 비용이 많이 들고, 제한된 시나리오만 커버하며, 재현하기 어렵다는 단점이 있습니다. 이러한 문제점을 해결하기 위해, 우리는 제안된 행동에 따라 현실적인 미래 관측 데이터를 생성하면서도 제어 가능하고 장시간 안정적인 시뮬레이션을 제공하는 실제 환경 시뮬레이터를 개발했습니다. 우리는 X-World를 제안합니다. X-World는 행동에 따라 달라지는 다중 카메라 기반 생성 시뮬레이션 모델로, 미래 관측 데이터를 직접 비디오 공간에서 생성합니다. X-World는 동기화된 다중 카메라 영상 데이터와 미래 행동 시퀀스를 입력으로 받아, 지정된 행동에 따라 미래 다중 카메라 비디오 스트림을 생성합니다. 또한, 재현 가능하고 편집 가능한 시나리오 생성을 위해, X-World는 동적 교통 요소 및 정적 도로 요소에 대한 선택적 제어를 지원하며, 텍스트 프롬프트를 통해 날씨 및 시간과 같은 시각적 요소에 대한 제어도 가능합니다. X-World는 시뮬레이션 환경 외에도, 시각적 프롬프트에 따라 비디오 스타일 변환을 수행하면서도, 기본적인 행동 및 장면 역학을 유지할 수 있습니다. X-World의 핵심은 다양한 제어 신호 하에서 다중 카메라 간의 기하학적 일관성과 시간적 일관성을 명시적으로 유지하도록 설계된 다중 뷰 잠재 비디오 생성기입니다. 실험 결과, X-World는 (i) 카메라 간의 뛰어난 시야 일관성, (ii) 장시간 시뮬레이션 동안 안정적인 시간적 역학, (iii) 엄격한 행동 추종 및 선택적 장면 제어에 대한 충실한 준수와 함께 고품질의 다중 뷰 비디오 생성을 달성합니다. 이러한 특성들은 X-World를 확장 가능하고 재현 가능한 평가를 위한 실용적인 기반으로 만듭니다.

Original Abstract

Scalable and reliable evaluation is increasingly critical in the end-to-end era of autonomous driving, where vision--language--action (VLA) policies directly map raw sensor streams to driving actions. Yet, current evaluation pipelines still rely heavily on real-world road testing, which is costly, biased toward limited scenario coverage, and difficult to reproduce. These challenges motivate a real-world simulator that can generate realistic future observations under proposed actions, while remaining controllable and stable over long horizons. We present X-World, an action-conditioned multi-camera generative world model that simulates future observations directly in video space. Given synchronized multi-view camera history and a future action sequence, X-World generates future multi-camera video streams that follow the commanded actions. To ensure reproducible and editable scene rollouts, X-World further supports optional controls over dynamic traffic agents and static road elements, and retains a text-prompt interface for appearance-level control (e.g., weather and time of day). Beyond world simulation, X-World also enables video style transfer by conditioning on appearance prompts while preserving the underlying action and scene dynamics. At the core of X-World is a multi-view latent video generator designed to explicitly encourage cross-view geometric consistency and temporal coherence under diverse control signals. Experiments show that X-World achieves high-quality multi-view video generation with (i) strong view consistency across cameras, (ii) stable temporal dynamics over long rollouts, and (iii) high controllability with strict action following and faithful adherence to optional scene controls. These properties make X-World a practical foundation for scalable and reproducible evaluation.

4 Citations

1 Influential

6.5 Altmetric

38.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!