2601.17067v1 Jan 22, 2026 cs.CV

비디오 생성을 월드 모델로 보는 메커니즘적 관점: 상태와 동역학

A Mechanistic View on Video Generation as World Models: State and Dynamics

Man Chen

Citations: 10

h-index: 2

Luozhou Wang

Citations: 207

h-index: 9

Zhifei Chen

Citations: 38

h-index: 3

Yihua Du

Citations: 11

h-index: 2

Dongyu Yan

Citations: 63

h-index: 4

Wenhang Ge

Citations: 100

h-index: 7

Guibao Shen

Citations: 35

h-index: 3

Xinli Xu

Citations: 73

h-index: 6

Leyi Wu

Citations: 49

h-index: 3

Tianshuo Xu

Citations: 86

h-index: 4

Peiran Ren

Citations: 174

h-index: 6

Xin Tao

Citations: 6

h-index: 2

Pengfei Wan

Citations: 332

h-index: 10

Ying-Cong Chen

Citations: 31

h-index: 3

대규모 비디오 생성 모델은 뛰어난 물리적 일관성을 보여주며, 잠재적인 월드 모델로서의 가능성을 제시합니다. 그러나 현재의 '상태가 없는' 비디오 아키텍처와 고전적인 상태 중심 월드 모델 이론 사이에는 간극이 존재합니다. 본 연구는 '상태 구축'과 '동역학 모델링'이라는 두 가지 핵심 요소를 중심으로 이 간극을 해소하는 새로운 분류 체계를 제안합니다. 우리는 상태 구축을 암묵적 패러다임(컨텍스트 관리)과 명시적 패러다임(잠재 압축)으로 분류하고, 동역학 모델링은 지식 통합과 아키텍처 재구성을 통해 분석합니다. 또한, 우리는 평가 방식을 시각적 충실도에서 기능적 벤치마크로 전환하여 물리적 지속성과 인과 추론 능력을 테스트할 것을 제안합니다. 결론적으로, 데이터 기반 메모리와 압축된 충실도를 활용하여 지속성을 향상시키고, 잠재 요인 분리 및 추론 사전 통합을 통해 인과 추론 능력을 발전시키는 두 가지 중요한 과제를 제시합니다. 이러한 과제들을 해결함으로써, 연구 분야는 시각적으로 설득력 있는 비디오를 생성하는 것에서 벗어나 견고하고 범용적인 월드 시뮬레이터를 구축하는 방향으로 발전할 수 있습니다.

Original Abstract

Large-scale video generation models have demonstrated emergent physical coherence, positioning them as potential world models. However, a gap remains between contemporary "stateless" video architectures and classic state-centric world model theories. This work bridges this gap by proposing a novel taxonomy centered on two pillars: State Construction and Dynamics Modeling. We categorize state construction into implicit paradigms (context management) and explicit paradigms (latent compression), while dynamics modeling is analyzed through knowledge integration and architectural reformulation. Furthermore, we advocate for a transition in evaluation from visual fidelity to functional benchmarks, testing physical persistence and causal reasoning. We conclude by identifying two critical frontiers: enhancing persistence via data-driven memory and compressed fidelity, and advancing causality through latent factor decoupling and reasoning-prior integration. By addressing these challenges, the field can evolve from generating visually plausible videos to building robust, general-purpose world simulators.

2 Citations

0 Influential

5 Altmetric

27.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!