2606.06076v1 Jun 04, 2026 cs.AI

Learning Visual Spatial Planning from Symbolic State via Modality-Gap-Aware Self-Distillation

Zhizhou Zhong
Zhizhou Zhong
Citations: 258
h-index: 4
Quan Shi
Quan Shi
Citations: 307
h-index: 4
Haochen Luo
Haochen Luo
Citations: 134
h-index: 4
Xiu Li
Xiu Li
Citations: 155
h-index: 6
Jiahui Liu
Jiahui Liu
Citations: 57
h-index: 2
Ruicheng Zhang
Ruicheng Zhang
Citations: 53
h-index: 4
Jiaqi Huang
Jiaqi Huang
Citations: 7
h-index: 2
Zunnan Xu
Zunnan Xu
Tsinghua University
Citations: 2,003
h-index: 15
Jun Zhou
Jun Zhou
Citations: 144
h-index: 5

While vision-language models excel at general multimodal understanding, they still struggle with visual spatial planning. We attribute this to a perception-reasoning modality gap: visual planning requires models to infer latent state structures from pixels and then reason over the recovered structure to produce valid actions, whereas symbolic planning directly leverages explicit objects and constraints. This creates dual bottlenecks in visual state recovery and multi-step planning. To address this, we propose MGSD, a two-stage modality-gap-aware self-distillation framework. First, a cold-start grounding stage equips the visual student with reliable state representations, minimizing early perception noise. Second, a privileged teacher transfers planning capabilities via on-policy distillation, using explicit symbolic states to supervise the student's own visual rollout prefixes. Crucially, symbolic data is used strictly during training, leaving inference purely visual. Experiments on visual planning benchmarks show that MGSD consistently improves visual planning across both 4B and 8B backbones, raising the macro average by 19.3% and 18.4%, respectively. The resulting models narrow the gap to symbolic-input upper bounds, while ablations and diagnostics confirm that the improvement comes from both visual state recovery and optimal-path reasoning. These results suggest that modality-gap-aware self-distillation improves not only how models perceive actionable states, but also how they plan over the inferred structure. Code is available at https://github.com/Oranger-l/MGSD.

0 Citations
0 Influential
41.951858789481 Altmetric
209.8 Score
Original PDF
17

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!