2605.26038v1 May 25, 2026 cs.CV

DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

Jianze Li
Jianze Li
Citations: 120
h-index: 7
Ziqing Zhang
Ziqing Zhang
Citations: 44
h-index: 4
Xinrui Shi
Xinrui Shi
Citations: 1
h-index: 1
Kai Liu
Kai Liu
Citations: 97
h-index: 5
Anqi Li
Anqi Li
Citations: 43
h-index: 4
Yulun Zhang
Yulun Zhang
Citations: 9
h-index: 2

Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing training signals provide no explicit grounding between reasoning steps and the underlying visual entities and relations, leaving lightweight models free to generate fluent but visually unanchored reasoning chains. To address this gap, we first introduce DRBench, a benchmark of 14,573 questions across 2,943 images, organized into five task categories spanning three progressive reasoning layers. Building on DRBench, we propose DRScaffold, a supervised fine-tuning framework that decomposes the supervision target into four causally ordered stages, enforcing grounded reasoning without architectural modification. Experiments on three lightweight VLMs demonstrate substantial gains on DRBench while preserving or improving performance on general-purpose benchmarks. Notably, Qwen2.5-VL-3B trained with DRScaffold surpasses the frozen Qwen2.5-VL-32B on DRBench, demonstrating that structured supervision can substitute for a significant portion of model scale in dense-scene reasoning. Our code and models are available at https://github.com/irene-shi/DRScaffold .

0 Citations
0 Influential
30.431471805599 Altmetric
152.2 Score
Original PDF
3

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!