2602.08025v2 Feb 08, 2026 cs.CV

MIND: 월드 모델에서의 메모리 일관성 및 액션 제어 벤치마킹

MIND: Benchmarking Memory Consistency and Action Control in World Models

Alex Wang

Citations: 120

h-index: 5

Yuxin Jiang

Citations: 49

h-index: 4

Yuchao Gu

Citations: 1,578

h-index: 11

Yixuan Ye

Citations: 11

h-index: 2

Xuanyu Lu

Citations: 9

h-index: 2

Qiwei Liang

Citations: 177

h-index: 3

Jiachun Pan

Citations: 219

h-index: 4

Fengda Zhang

Citations: 30

h-index: 3

Weijia Wu

Citations: 6

h-index: 2

Rui Zhao

Citations: 465

h-index: 7

월드 모델은 동적인 시각 환경을 이해하고 기억하며 예측하는 것을 목표로 하지만, 이러한 기본적인 능력들을 평가하기 위한 통합된 벤치마크는 아직 부족합니다. 이러한 격차를 해소하기 위해, 우리는 월드 모델에서 메모리 일관성과 액션 제어를 평가하기 위한 최초의 오픈 도메인 폐루프 재방문 벤치마크인 MIND를 소개합니다. MIND는 1080p 및 24 FPS의 고품질 비디오 250개로 구성되어 있으며, 100개의 (1인칭) + 100개의 (3인칭) 비디오 클립은 공유된 액션 공간에서, 25개 + 25개의 클립은 다양한 액션 공간을 포함하여 총 8가지 다양한 장면을 다룹니다. 우리는 시간적 안정성과 다양한 시점에서 나타나는 문맥적 일관성을 측정하기 위해, 메모리 일관성과 액션 제어라는 두 가지 핵심 능력을 평가하는 효율적인 프레임워크를 설계했습니다. 또한, 다양한 액션 공간을 설계하여, 공유된 장면 하에서 액션 일반화 능력을 평가합니다. MIND에서의 향후 성능 벤치마킹을 용이하게 하기 위해, 우리는 새로운 인터랙티브 비디오-월드 기준 모델인 MIND-World를 소개합니다. 광범위한 실험 결과는 MIND의 완전성을 입증하고, 현재 월드 모델의 주요 과제를 드러냅니다. 이러한 과제에는 장기적인 메모리 일관성을 유지하는 어려움과 다양한 액션 공간에서의 일반화 문제가 포함됩니다. 코드: https://github.com/CSU-JPG/MIND.

Original Abstract

World models aim to understand, remember, and predict dynamic visual environments, yet a unified benchmark for evaluating their fundamental abilities remains lacking. To address this gap, we introduce MIND, the first open-domain closed-loop revisited benchmark for evaluating Memory consIstency and action coNtrol in worlD models. MIND contains 250 high-quality videos at 1080p and 24 FPS, including 100 (first-person) + 100 (third-person) video clips under a shared action space and 25 + 25 clips across varied action spaces covering eight diverse scenes. We design an efficient evaluation framework to measure two core abilities: memory consistency and action control, capturing temporal stability and contextual coherence across viewpoints. Furthermore, we design various action spaces, including different character movement speeds and camera rotation angles, to evaluate the action generalization capability across different action spaces under shared scenes. To facilitate future performance benchmarking on MIND, we introduce MIND-World, a novel interactive Video-to-World baseline. Extensive experiments demonstrate the completeness of MIND and reveal key challenges in current world models, including the difficulty of maintaining long-term memory consistency and generalizing across action spaces. Code: https://github.com/CSU-JPG/MIND.

4 Citations

0 Influential

45.351459567761 Altmetric

230.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!