2606.12886v1 Jun 11, 2026 cs.CV

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

Conghui He
Conghui He
Citations: 394
h-index: 10
Jingxuan Wei
Jingxuan Wei
Citations: 424
h-index: 12
Siyuan Li
Siyuan Li
Citations: 14
h-index: 2
Xinglong Xu
Xinglong Xu
Citations: 29
h-index: 3
Cheng Tan
Cheng Tan
Citations: 10
h-index: 2
Yujun Wu
Yujun Wu
Citations: 17
h-index: 3
Tingyun Li
Tingyun Li
Citations: 63
h-index: 5
Leran Zhou
Leran Zhou
Citations: 0
h-index: 0

Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.

0 Citations
0 Influential
6 Altmetric
30.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!