2605.28277v1 May 27, 2026 cs.AI

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Xi Xiao
Xi Xiao
Citations: 34
h-index: 1
Zhangquan Chen
Zhangquan Chen
Citations: 213
h-index: 8
Chunlei Meng
Chunlei Meng
Citations: 9
h-index: 2
Zhikai Pan
Zhikai Pan
Citations: 0
h-index: 0
Chih-Ting Liao
Chih-Ting Liao
Citations: 14
h-index: 2
Yitong Qiao
Yitong Qiao
Citations: 25
h-index: 3
Chunrui Liu
Chunrui Liu
Citations: 121
h-index: 5
Xinzhuo Cao
Xinzhuo Cao
Citations: 2
h-index: 1

Whether large language models (LLMs) construct internal spatial world models from pure-text descriptions remains contested, and whether such capabilities transfer across languages has not been systematically studied. We introduce MentalMap, a multilingual diagnostic benchmark with a six-level capability hierarchy (L0-L5) spanning atomic spatial facts to generative world-graph construction, together with four diagnostic axes probing frame of reference, reading-direction bias, reasoning-effort allocation, and hallucination. MentalMap is built from 100 ProcTHOR household scenes, covers eight typologically diverse languages plus a structured-text control, and contains 39 task families across 1,950 evaluation cells. Evaluating thirteen LLMs across scales and model families, we identify a universal L3 reasoning cliff: no model retains even half of its L0 performance on viewpoint reasoning once baseline atomic accuracy exceeds 40%. The cliff persists across languages, scales, and prompting strategies, while structured-output failures and reasoning patterns vary substantially across models. Human evaluation under the identical pure-text protocol reproduces the same failure pattern, suggesting that the bottleneck arises from text-only working memory constraints rather than being specific to current LLM architectures. Our findings reframe pure-text spatial reasoning as a multi-axis world-modeling problem and motivate multimodal and scratchpad-augmented reasoning as future directions.

0 Citations
0 Influential
4 Altmetric
20.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!