2602.05986v1 Feb 05, 2026 cs.CV

RISE-Video: 비디오 생성 모델은 암묵적인 세계 규칙을 이해할 수 있는가?

RISE-Video: Can Video Generators Decode Implicit World Rules?

Xue Yang

Citations: 46

h-index: 3

Xing Sun

Citations: 112

h-index: 5

Mingxin Liu

Citations: 33

h-index: 3

Shuran Ma

Citations: 7

h-index: 2

Shibei Meng

Citations: 105

h-index: 3

Xiangyu Zhao

Citations: 109

h-index: 2

Shaofeng Zhang

Citations: 109

h-index: 2

Zhihang Zhong

Citations: 136

h-index: 6

Pei-Pei Chen

Citations: 6

h-index: 1

Haoyu Cao

Citations: 73

h-index: 3

Haodong Duan

Citations: 66

h-index: 5

Zicheng Zhang

Citations: 4,424

h-index: 33

생성형 비디오 모델은 놀라운 시각적 충실도를 달성했지만, 암묵적인 세계 규칙을 내재화하고 추론하는 능력은 중요한 과제이며 아직 충분히 연구되지 않은 영역입니다. 이러한 격차를 해소하기 위해, 본 논문에서는 텍스트-이미지-비디오(TI2V) 생성에 특화된, 추론 중심의 새로운 벤치마크인 RISE-Video를 제안합니다. RISE-Video는 표면적인 미학적 측면에서 벗어나, 심층적인 인지적 추론 능력을 평가하는 데 초점을 맞춥니다. RISE-Video는 8가지 엄격한 범주에 걸쳐 467개의 세심하게 인간이 주석을 단 샘플로 구성되어 있으며, 상식 및 공간 역학에서부터 특정 전문 분야에 이르기까지 다양한 측면에서 모델의 지능을 검증하기 위한 체계적인 테스트 환경을 제공합니다. 본 연구에서는 추론 정합성, 시간적 일관성, 물리적 타당성, 그리고 시각적 품질의 네 가지 지표를 포함하는 다차원 평가 프로토콜을 도입했습니다. 또한, 인간 중심의 평가를 모방하기 위해, 대규모 다중 모달 모델(LMM)을 활용한 자동화된 평가 파이프라인을 제안합니다. 11개의 최첨단 TI2V 모델에 대한 광범위한 실험 결과, 암묵적인 제약 조건 하에서 복잡한 시나리오를 시뮬레이션하는 데 있어 광범위한 결함이 존재하며, 이는 향후 세계 시뮬레이션 생성 모델의 발전을 위한 중요한 통찰력을 제공합니다.

Original Abstract

While generative video models have achieved remarkable visual fidelity, their capacity to internalize and reason over implicit world rules remains a critical yet under-explored frontier. To bridge this gap, we present RISE-Video, a pioneering reasoning-oriented benchmark for Text-Image-to-Video (TI2V) synthesis that shifts the evaluative focus from surface-level aesthetics to deep cognitive reasoning. RISE-Video comprises 467 meticulously human-annotated samples spanning eight rigorous categories, providing a structured testbed for probing model intelligence across diverse dimensions, ranging from commonsense and spatial dynamics to specialized subject domains. Our framework introduces a multi-dimensional evaluation protocol consisting of four metrics: \textit{Reasoning Alignment}, \textit{Temporal Consistency}, \textit{Physical Rationality}, and \textit{Visual Quality}. To further support scalable evaluation, we propose an automated pipeline leveraging Large Multimodal Models (LMMs) to emulate human-centric assessment. Extensive experiments on 11 state-of-the-art TI2V models reveal pervasive deficiencies in simulating complex scenarios under implicit constraints, offering critical insights for the advancement of future world-simulating generative models.

1 Citations

0 Influential

16.5 Altmetric

83.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!