2605.28023v1 May 27, 2026 cs.CV

VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning

Yankai Yang
Yankai Yang
Citations: 22
h-index: 3
Yancheng Long
Yancheng Long
Citations: 13
h-index: 2
Tianke Zhang
Tianke Zhang
Citations: 351
h-index: 8
Kaiyu Jiang
Kaiyu Jiang
Citations: 281
h-index: 5
Haonan Fan
Haonan Fan
Citations: 175
h-index: 3
Changyi Liu
Changyi Liu
Citations: 317
h-index: 7
Tingting Gao
Tingting Gao
Citations: 503
h-index: 12
Xingyu Lu
Xingyu Lu
Citations: 140
h-index: 3
Jinpeng Wang
Jinpeng Wang
Citations: 595
h-index: 12
Xuanyu Zheng
Xuanyu Zheng
Citations: 13
h-index: 1
Bin Wen
Bin Wen
Citations: 489
h-index: 10
Hanqi Li
Hanqi Li
Citations: 44
h-index: 4
Yiyang Fan
Yiyang Fan
Citations: 346
h-index: 2
Chun Yuan
Chun Yuan
Citations: 88
h-index: 3
Yi-Fan Zhang
Yi-Fan Zhang
Citations: 137
h-index: 3
Fan Yang
Fan Yang
Citations: 140
h-index: 6

Visual captioning requires models to capture visual content faithfully while minimizing both omission and hallucination. As the dominant paradigm for captioning, MLLMs have achieved strong performance through scaling and high-quality data. Recently, RL has emerged as a key route to driving MLLMs toward higher precision and broader coverage, however, existing reward designs for captioning fail to provide fine-grained and reliable signals for factual verification, limiting their effectiveness. To address this, we propose VCap, a Witness-Adjudicator reward that pairs the reference caption (a witness) with the visual signal (an adjudicator). By explicitly verifying factual consistency between the reference and policy-generated captions grounded in the visual signal, VCap delivers a reward signal with hypergeometric-distribution-level precision for caption quality verification. This design enables effective learning even from imperfect references, facilitating weak-to-strong generalization in RL training. In our experiments, an 8B model trained with VCap outperforms open- and closed-source SOTA models on multiple image and video captioning benchmarks. Human evaluation further confirms its strong alignment with factual correctness. Additionally, VCap improves MLLM perceptual capability, generalizes across tasks, and surpasses best-of-N distillation, challenging prior assumptions about RLVR.

1 Citations
0 Influential
6 Altmetric
31.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!