2601.19834v1 Jan 27, 2026 cs.AI

시각적 생성은 멀티모달 세계 모델을 통해 인간과 유사한 추론을 가능하게 한다

Visual Generation Unlocks Human-Like Reasoning through Multimodal World Models

Hongyi Yuan

Citations: 5,379

h-index: 18

Jialong Wu

Tsinghua University

Citations: 553

h-index: 9

Xiaoying Zhang

Citations: 7

h-index: 1

Xiangcheng Zhang

Citations: 4

h-index: 1

Chaoyi Deng

Citations: 203

h-index: 2

Renrui Zhang

Citations: 7

h-index: 1

Youbin Wu

Citations: 280

h-index: 3

Mingsheng Long

Citations: 52

h-index: 4

Tianhao Huang

Citations: 74

h-index: 4

Changjing He

Citations: 27

h-index: 2

인간은 내부 세계 모델을 구축하고 그 모델 내의 개념들을 조작함으로써 추론한다. 최근 AI의 발전, 특히 생각의 사슬(CoT) 추론은 이러한 인간의 인지 능력을 근사하며, 대규모 언어 모델 내에 세계 모델이 내재되어 있다고 여겨진다. 수학이나 프로그래밍과 같은 형식적이고 추상적인 영역에서의 전문가 수준의 성능은 주로 언어적 추론에 의존함으로써 현재 시스템에서 달성되었다. 그러나 더 풍부한 표현과 사전 지식을 필요로 하는 물리적 및 공간적 지능과 같은 영역에서는 여전히 인간보다 훨씬 뒤처져 있다. 따라서 언어적 생성과 시각적 생성이 모두 가능한 통합 멀티모달 모델(UMM)의 등장은 상호 보완적인 멀티모달 경로에 기반한 더 인간과 유사한 추론에 대한 관심을 불러일으켰으나, 그 이점은 아직 불분명하다. 본 논문은 세계 모델 관점에서 시각적 생성이 언제, 어떻게 추론에 이득을 주는지에 대한 최초의 원칙적 연구를 제시한다. 우리의 핵심 입장은 '시각적 우월성 가설'이다. 즉, 특정 과제(특히 물리적 세계에 기반한 과제)의 경우 시각적 생성이 세계 모델로서 더 자연스럽게 기능하는 반면, 순수 언어적 세계 모델은 표현의 한계나 불충분한 사전 지식으로 인해 병목 현상을 겪는다는 것이다. 이론적으로, 우리는 내부 세계 모델링을 CoT 추론의 핵심 요소로 공식화하고 다양한 형태의 세계 모델 간의 차이점을 분석한다. 경험적으로, 우리는 시각과 언어가 교차하는 CoT 추론을 필요로 하는 과제들을 식별하고, 새로운 평가 세트인 VisWorld-Eval을 구축한다. 최신 UMM에 대한 통제된 실험 결과, 시각적 세계 모델링에 유리한 과제에서는 교차 CoT가 순수 언어적 CoT보다 성능이 훨씬 뛰어났지만, 그 외의 경우에는 뚜렷한 이점을 제공하지 않는 것으로 나타났다. 종합하면, 본 연구는 더 강력하고 인간과 유사한 멀티모달 AI를 위한 멀티모달 세계 모델링의 잠재력을 명확히 한다.

Original Abstract

Humans construct internal world models and reason by manipulating the concepts within these models. Recent advances in AI, particularly chain-of-thought (CoT) reasoning, approximate such human cognitive abilities, where world models are believed to be embedded within large language models. Expert-level performance in formal and abstract domains such as mathematics and programming has been achieved in current systems by relying predominantly on verbal reasoning. However, they still lag far behind humans in domains like physical and spatial intelligence, which require richer representations and prior knowledge. The emergence of unified multimodal models (UMMs) capable of both verbal and visual generation has therefore sparked interest in more human-like reasoning grounded in complementary multimodal pathways, though their benefits remain unclear. From a world-model perspective, this paper presents the first principled study of when and how visual generation benefits reasoning. Our key position is the visual superiority hypothesis: for certain tasks--particularly those grounded in the physical world--visual generation more naturally serves as world models, whereas purely verbal world models encounter bottlenecks arising from representational limitations or insufficient prior knowledge. Theoretically, we formalize internal world modeling as a core component of CoT reasoning and analyze distinctions among different forms of world models. Empirically, we identify tasks that necessitate interleaved visual-verbal CoT reasoning, constructing a new evaluation suite, VisWorld-Eval. Controlled experiments on a state-of-the-art UMM show that interleaved CoT significantly outperforms purely verbal CoT on tasks that favor visual world modeling, but offers no clear advantage otherwise. Together, this work clarifies the potential of multimodal world modeling for more powerful, human-like multimodal AI.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!