2606.05677v1 Jun 04, 2026 cs.CV

LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

Honggang Zhang
Honggang Zhang
Citations: 38
h-index: 3
Longteng Guo
Longteng Guo
Citations: 1,611
h-index: 18
Shiqiang Lang
Shiqiang Lang
Citations: 30
h-index: 2
Peiwen Sun
Peiwen Sun
Citations: 94
h-index: 4
Jing Liu
Jing Liu
Citations: 21
h-index: 3
Haoyang He
Haoyang He
Citations: 32
h-index: 2
Yuanteng Chen
Yuanteng Chen
Citations: 39
h-index: 3
Tao Liu
Tao Liu
Citations: 15
h-index: 2
Lan Yang
Lan Yang
Citations: 24
h-index: 2

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

0 Citations
0 Influential
9 Altmetric
45.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!