2602.00096v1 Jan 24, 2026 cs.CV

Mirage2Matter: 비디오 기반의 물리적 제약 조건이 적용된 가우시안 월드 모델

Mirage2Matter: A Physically Grounded Gaussian World Model from Video

Yandong Guo

Citations: 48

h-index: 4

Xin Wang

Citations: 2

h-index: 1

Zhengqing Gao

Citations: 22

h-index: 2

Jiaxin Huang

Citations: 29

h-index: 3

Zhenyang Ren

Citations: 1

h-index: 1

Ming Shao

Citations: 7

h-index: 1

Hanlue Zhang

Citations: 5

h-index: 2

Tianyu Huang

Citations: 10

h-index: 2

Yongkang Cheng

Citations: 102

h-index: 5

Runqi Lin

University of Oxford

Citations: 91

h-index: 6

Yuanyuan Wang

Citations: 1

h-index: 1

Tongliang Liu

Citations: 118

h-index: 7

Kun Zhang

Citations: 0

h-index: 0

Mingming Gong

Citations: 624

h-index: 9

Ziwen Li

Citations: 163

h-index: 6

구체화된 지능의 확장성은 실제 세계와의 상호 작용 데이터의 부족으로 인해 근본적으로 제한됩니다. 시뮬레이션 플랫폼은 유망한 대안을 제공하지만, 기존 접근 방식은 종종 실제 환경과의 상당한 시각적 및 물리적 격차를 가지고 있으며, 비용이 많이 드는 센서, 정밀한 로봇 보정 또는 깊이 측정을 필요로 하여 확장성에 한계를 갖습니다. 본 논문에서는 멀티뷰 환경 비디오와 상용 자산만을 사용하여 고품질의 구체화된 학습 데이터를 효율적으로 생성할 수 있는 그래픽 기반의 월드 모델링 및 시뮬레이션 프레임워크인 Simulate Anything을 제시합니다. 저희의 접근 방식은 3D 가우시안 스플래팅(3DGS)을 사용하여 실제 환경을 사실적인 장면 표현으로 재구성하여 비디오에서 미세한 기하학적 구조와 외관을 원활하게 캡처합니다. 그런 다음 생성 모델을 활용하여 물리적으로 현실적인 표현을 복원하고, 정밀 보정 목표를 통해 이를 시뮬레이션 환경에 통합하여 재구성된 장면과 실제 세계 간의 정확한 크기 정렬을 가능하게 합니다. 이러한 구성 요소들은 통합적이고 편집 가능하며 물리적 제약 조건이 적용된 월드 모델을 제공합니다. 저희가 생성한 시뮬레이션 데이터로 학습된 비전-언어-액션(VLA) 모델은 다운스트림 작업에서 뛰어난 제로샷 성능을 달성하며, 실제 데이터로 얻은 결과와 일치하거나 능가하는 성능을 보여줍니다. 이는 재구성 기반의 월드 모델링이 확장 가능하고 실용적인 구체화된 지능 학습을 위한 잠재력을 가지고 있음을 강조합니다.

Original Abstract

The scalability of embodied intelligence is fundamentally constrained by the scarcity of real-world interaction data. While simulation platforms provide a promising alternative, existing approaches often suffer from a substantial visual and physical gap to real environments and rely on expensive sensors, precise robot calibration, or depth measurements, limiting their practicality at scale. We present Simulate Anything, a graphics-driven world modeling and simulation framework that enables efficient generation of high-fidelity embodied training data using only multi-view environment videos and off-the-shelf assets. Our approach reconstructs real-world environments into a photorealistic scene representation using 3D Gaussian Splatting (3DGS), seamlessly capturing fine-grained geometry and appearance from video. We then leverage generative models to recover a physically realistic representation and integrate it into a simulation environment via a precision calibration target, enabling accurate scale alignment between the reconstructed scene and the real world. Together, these components provide a unified, editable, and physically grounded world model. Vision Language Action (VLA) models trained on our simulated data achieve strong zero-shot performance on downstream tasks, matching or even surpassing results obtained with real-world data, highlighting the potential of reconstruction-driven world modeling for scalable and practical embodied intelligence training.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!