2605.29360v1 May 28, 2026 cs.AI

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

Jiayi Zhou
Jiayi Zhou
Citations: 1,084
h-index: 12
Juntao Dai
Juntao Dai
Citations: 20
h-index: 3
Jiawei Chen
Jiawei Chen
Citations: 51
h-index: 3
Tianzhuo Yang
Tianzhuo Yang
Citations: 6
h-index: 1
Jiaming Ji
Jiaming Ji
Citations: 1,019
h-index: 18
Yaodong Yang
Yaodong Yang
Citations: 26
h-index: 2
Zirui Mi
Zirui Mi
Citations: 0
h-index: 0
Zhaoyi Zhang
Zhaoyi Zhang
Citations: 9
h-index: 1
Boyuan Chen
Boyuan Chen
Citations: 22
h-index: 1
Zihan Shen
Zihan Shen
Citations: 108
h-index: 5

Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textsc{MiraBench}, a hierarchical benchmark that defines \emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emph{Physics Adherence}, which evaluates reference-free physical consistency; \emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.

0 Citations
0 Influential
9 Altmetric
45.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!