2603.09731v1 Mar 10, 2026 cs.CV

EXPLORE-Bench: 장기 추론을 통한 1인칭 시나리오 예측

EXPLORE-Bench: Egocentric Scene Prediction with Long-Horizon Reasoning

Chengjun Yu

Citations: 41

h-index: 3

Xuhan Zhu

Citations: 19

h-index: 3

Chaoqun Du

Citations: 12

h-index: 2

Pengfei Yu

Citations: 19

h-index: 3

Wei Zhai

Citations: 1,720

h-index: 25

Zhengjun Zha

Citations: 461

h-index: 13

Yang Cao

Citations: 36

h-index: 3

다중 모드 대규모 언어 모델(MLLM)은 점점 더 자율 에이전트의 기반 기술로 고려되고 있지만, 이러한 모델들이 1인칭 시점에서 수행되는 행동의 장기적인 물리적 결과에 대해 얼마나 신뢰성 있게 추론할 수 있는지에 대한 명확성은 여전히 부족합니다. 본 연구는 새로운 태스크인 '장기 추론을 통한 1인칭 시나리오 예측(Egocentric Scene Prediction with LOng-horizon REasoning)'을 통해 이러한 격차를 연구합니다. 이 태스크에서 모델은 초기 장면 이미지와 일련의 원자적 행동 설명을 입력으로 받아, 모든 행동이 실행된 후의 최종 장면을 예측해야 합니다. 체계적인 평가를 위해, 실제 1인칭 동영상으로 구성된 다양한 시나리오를 포함하는 벤치마크인 EXPLORE-Bench를 소개합니다. 각 인스턴스는 긴 행동 시퀀스와 객체 카테고리, 시각적 속성, 객체 간 관계를 포함하는 구조화된 최종 장면 어노테이션을 쌍으로 제공하며, 이를 통해 세밀하고 정량적인 평가가 가능합니다. 다양한 독점적 및 오픈 소스 MLLM에 대한 실험 결과, 인간 수준의 성능에 상당한 격차가 존재한다는 것을 보여주며, 이는 장기적인 1인칭 추론이 여전히 중요한 과제임을 시사합니다. 또한, 단계별 추론을 통한 테스트 시간 확장 분석을 통해, 긴 행동 시퀀스를 분해하면 어느 정도 성능 향상을 가져올 수 있지만, 상당한 계산 비용이 발생한다는 것을 확인했습니다. 전반적으로, EXPLORE-Bench는 1인칭 자율 인지에서 장기 추론 능력을 측정하고 발전시키기 위한 체계적인 테스트 환경을 제공합니다.

Original Abstract

Multimodal large language models (MLLMs) are increasingly considered as a foundation for embodied agents, yet it remains unclear whether they can reliably reason about the long-term physical consequences of actions from an egocentric viewpoint. We study this gap through a new task, Egocentric Scene Prediction with LOng-horizon REasoning: given an initial-scene image and a sequence of atomic action descriptions, a model is asked to predict the final scene after all actions are executed. To enable systematic evaluation, we introduce EXPLORE-Bench, a benchmark curated from real first-person videos spanning diverse scenarios. Each instance pairs long action sequences with structured final-scene annotations, including object categories, visual attributes, and inter-object relations, which supports fine-grained, quantitative assessment. Experiments on a range of proprietary and open-source MLLMs reveal a significant performance gap to humans, indicating that long-horizon egocentric reasoning remains a major challenge. We further analyze test-time scaling via stepwise reasoning and show that decomposing long action sequences can improve performance to some extent, while incurring non-trivial computational overhead. Overall, EXPLORE-Bench provides a principled testbed for measuring and advancing long-horizon reasoning for egocentric embodied perception.

0 Citations

0 Influential

12.5 Altmetric

62.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!