2602.08236v1 Feb 09, 2026 cs.CV

언제 그리고 얼마나 상상해야 하는가: 시각적 공간 추론을 위한 월드 모델 기반의 적응적 테스트 시간 스케일링

When and How Much to Imagine: Adaptive Test-Time Scaling with World Models for Visual Spatial Reasoning

Mingyu Ding

Citations: 290

h-index: 8

Shoubin Yu

Citations: 1,280

h-index: 12

Yue Zhang

Citations: 45

h-index: 4

Zun Wang

Citations: 266

h-index: 6

Jaehong Yoon

Citations: 271

h-index: 6

Huaxiu Yao

Citations: 443

h-index: 6

Mohit Bansal

Citations: 30

h-index: 4

다중 모달 대규모 언어 모델(MLLM)의 빠른 발전에도 불구하고, 시각적 공간 추론은 여전히 정확한 답변이 예상치 못한 시점이나 대체 시점에서 장면이 어떻게 보일지에 따라 달라질 때 신뢰성이 떨어지는 경향이 있습니다. 최근 연구에서는 시각적 상상력을 위한 월드 모델을 활용하여 이러한 문제를 해결하려 시도했지만, 상상이 실제로 언제 필요한지, 얼마나 많은 상상이 유용한지, 그리고 언제 상상이 해로운 결과를 초래하는지에 대한 질문은 여전히 명확하게 이해되지 않고 있습니다. 실제로, 무분별한 상상은 계산량을 증가시키고 심지어 오해를 불러일으키는 증거를 도입하여 성능을 저하시킬 수 있습니다. 본 연구에서는 시각적 공간 추론을 위한 제어 가능한 자원으로서의 테스트 시간 시각적 상상에 대한 심층적인 분석을 제시합니다. 정적인 시각 정보만으로 충분한 경우, 상상이 추론을 향상시키는 경우, 그리고 과도하거나 불필요한 상상이 정확성과 효율성에 미치는 영향을 연구합니다. 이러한 분석을 뒷받침하기 위해, 현재 시각 정보의 충분성에 대해 명시적으로 판단하고, 선택적으로 시각적 상상을 호출하고 스케일링하는 적응형 테스트 시간 프레임워크인 AVIC를 소개합니다. 다양한 시각적 공간 추론 벤치마크(SAT, MMSI) 및 임베디드 내비게이션 벤치마크(R2R)에서 얻은 결과는 상상이 중요한 경우, 미미한 영향을 미치는 경우, 그리고 해로운 영향을 미치는 경우를 명확하게 보여주며, 선택적인 제어를 통해 월드 모델 호출 횟수와 언어 토큰 수를 크게 줄이면서 기존의 고정된 상상 전략과 동등하거나 더 나은 성능을 달성할 수 있음을 보여줍니다. 전반적으로, 본 연구의 결과는 효율적이고 신뢰성 있는 시각적 공간 추론을 위해 테스트 시간 상상을 분석하고 제어하는 것의 중요성을 강조합니다.

Original Abstract

Despite rapid progress in Multimodal Large Language Models (MLLMs), visual spatial reasoning remains unreliable when correct answers depend on how a scene would appear under unseen or alternative viewpoints. Recent work addresses this by augmenting reasoning with world models for visual imagination, but questions such as when imagination is actually necessary, how much of it is beneficial, and when it becomes harmful, remain poorly understood. In practice, indiscriminate imagination can increase computation and even degrade performance by introducing misleading evidence. In this work, we present an in-depth analysis of test-time visual imagination as a controllable resource for spatial reasoning. We study when static visual evidence is sufficient, when imagination improves reasoning, and how excessive or unnecessary imagination affects accuracy and efficiency. To support this analysis, we introduce AVIC, an adaptive test-time framework with world models that explicitly reasons about the sufficiency of current visual evidence before selectively invoking and scaling visual imagination. Across spatial reasoning benchmarks (SAT, MMSI) and an embodied navigation benchmark (R2R), our results reveal clear scenarios where imagination is critical, marginal, or detrimental, and show that selective control can match or outperform fixed imagination strategies with substantially fewer world-model calls and language tokens. Overall, our findings highlight the importance of analyzing and controlling test-time imagination for efficient and reliable spatial reasoning.

6 Citations

1 Influential

6 Altmetric

38.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!