2603.16870v1 Mar 17, 2026 cs.CV

비디오 추론의 원리 규명

Demystifing Video Reasoning

Ruisi Wang

Citations: 214

h-index: 7

Ran Ji

Citations: 15

h-index: 2

Dahua Lin

Citations: 117

h-index: 8

Maijunxian Wang

Citations: 20

h-index: 2

Zhongang Cai

MMLab@NTU, Nanyang Technological University

Citations: 4,497

h-index: 31

Hokin Deng

Citations: 127

h-index: 6

Fanyi Pu

Citations: 958

h-index: 6

Wanqi Yin

Citations: 596

h-index: 11

Chenyang Gu

Citations: 143

h-index: 4

Ziqi Huang

Citations: 2,523

h-index: 13

Ju Xu

Citations: 12

h-index: 2

Bo Li

Citations: 5

h-index: 2

Ziwei Liu

Citations: 84

h-index: 5

Lei Yang

Citations: 380

h-index: 7

최근 비디오 생성 기술의 발전은 예상치 못한 현상을 드러냈습니다. 바로 확산 모델 기반 비디오 모델이 상당한 수준의 추론 능력을 갖추고 있다는 것입니다. 기존 연구에서는 이러한 현상을 '프레임 연쇄(Chain-of-Frames, CoF)' 메커니즘으로 설명하며, 추론이 비디오 프레임 간에 순차적으로 진행된다고 가정합니다. 본 연구에서는 이러한 가정을 비판적으로 검토하고, 근본적으로 다른 메커니즘을 밝혀냅니다. 우리는 비디오 모델에서 추론이 주로 확산 과정의 노이즈 제거 단계에서 발생하는 것을 보여줍니다. 질적 분석과 체계적인 실험을 통해, 모델이 초기 노이즈 제거 단계에서 다양한 후보 해를 탐색하고, 점진적으로 최종 해에 수렴하는 과정을 발견했으며, 이를 '단계 연쇄(Chain-of-Steps, CoS)'라고 명명했습니다. 또한, 모델 성능에 중요한 영향을 미치는 다음과 같은 다양한 추론 행동을 확인했습니다. (1) 작업 메모리: 지속적인 참조를 가능하게 함; (2) 자기 수정 및 향상: 잘못된 중간 결과로부터 회복 가능; (3) 인지-행동 순서: 초기 단계에서 의미론적 기반을 확립하고, 후속 단계에서 체계적인 조작을 수행. 더 나아가, 확산 변환기 내에서 자기 진화적인 기능적 전문화가 발생하는 것을 확인했습니다. 초기 레이어는 풍부한 시각적 정보를 인코딩하고, 중간 레이어는 추론을 수행하며, 후기 레이어는 잠재적 표현을 통합합니다. 이러한 통찰력을 바탕으로, 우리는 간단한 학습이 필요 없는 전략을 제시하여, 동일한 모델의 다양한 랜덤 시드를 가진 잠재적 경로를 결합함으로써 추론 능력을 향상시킬 수 있음을 보여줍니다. 전반적으로, 본 연구는 비디오 생성 모델에서 추론이 어떻게 발생하는지에 대한 체계적인 이해를 제공하며, 향후 연구자들이 비디오 모델의 내재된 추론 역학을 활용하여 지능을 향상시키는 데 중요한 기반을 제공합니다.

Original Abstract

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

3 Citations

0 Influential

15.5 Altmetric

80.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!