2602.13294v2 Feb 09, 2026 cs.CV

VisPhyWorld: 코드 기반 비디오 재구성을 통한 물리적 추론 검증

VisPhyWorld: Probing Physical Reasoning via Code-Driven Video Reconstruction

Jiarong Liang

Citations: 3

h-index: 1

Max W.F. Ku

Citations: 2,654

h-index: 13

Ka-Hei Hui

Citations: 460

h-index: 10

Ping Nie

Citations: 394

h-index: 11

Wenhu Chen

Citations: 224

h-index: 7

다중 모달 대규모 언어 모델(MLLM)이 실제로 물리적 역학에 대해 얼마나 잘 추론하는지 평가하는 것은 여전히 어려운 과제입니다. 대부분의 기존 벤치마크는 시각적 질문 응답(VQA) 및 기대 위반(VoE)과 같은 인식 기반 프로토콜에 의존하는데, 이러한 프로토콜은 종종 명시적이고 검증 가능한 물리적 가설을 설정하지 않고도 답변할 수 있습니다. 본 연구에서는 모델이 시각적 관찰로부터 실행 가능한 시뮬레이터 코드를 생성하도록 요구하여 물리적 추론을 평가하는 실행 기반 프레임워크인 VisPhyWorld를 제안합니다. 생성된 실행 가능한 코드는 추론된 세계 표현을 직접 검사, 편집 및 반증할 수 있도록 하며, 이는 물리적 추론과 렌더링을 분리합니다. 이 프레임워크를 기반으로, 우리는 108개의 물리적 템플릿에서 파생된 209개의 평가 장면으로 구성된 VisPhyBench를 소개합니다. VisPhyBench는 모델이 외관을 얼마나 잘 재구성하고 물리적으로 타당한 운동을 얼마나 정확하게 재현하는지 평가하는 체계적인 프로토콜을 포함합니다. 저희의 파이프라인은 벤치마크에서 97.7%의 정확도를 보이며 유효한 재구성된 비디오를 생성합니다. 실험 결과, 최첨단 MLLM은 뛰어난 의미론적 장면 이해 능력을 보이지만, 물리적 매개변수를 정확하게 추론하고 일관된 물리적 역학을 시뮬레이션하는 데 어려움을 겪는 것으로 나타났습니다.

Original Abstract

Evaluating whether Multimodal Large Language Models (MLLMs) genuinely reason about physical dynamics remains challenging. Most existing benchmarks rely on recognition-style protocols such as Visual Question Answering (VQA) and Violation of Expectation (VoE), which can often be answered without committing to an explicit, testable physical hypothesis. We propose VisPhyWorld, an execution-based framework that evaluates physical reasoning by requiring models to generate executable simulator code from visual observations. By producing runnable code, the inferred world representation is directly inspectable, editable, and falsifiable. This separates physical reasoning from rendering. Building on this framework, we introduce VisPhyBench, comprising 209 evaluation scenes derived from 108 physical templates and a systematic protocol that evaluates how well models reconstruct appearance and reproduce physically plausible motion. Our pipeline produces valid reconstructed videos in 97.7% on the benchmark. Experiments show that while state-of-the-art MLLMs achieve strong semantic scene understanding, they struggle to accurately infer physical parameters and to simulate consistent physical dynamics.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!