2605.30244v1 May 28, 2026 cs.CV

Reinforcement Learning with Robust Rubric Rewards

Dandan Tu
Dandan Tu
Citations: 50
h-index: 3
Yong Liao
Yong Liao
Citations: 9
h-index: 2
Ya-Qi Yu
Ya-Qi Yu
Citations: 75
h-index: 3
Fang Hong
Fang Hong
Citations: 2
h-index: 1
Xiangyan Qu
Xiangyan Qu
Citations: 35
h-index: 3
Gaojie Wu
Gaojie Wu
Citations: 250
h-index: 5
Nuo Xu
Nuo Xu
Citations: 168
h-index: 2
Huixin Wang
Huixin Wang
Citations: 11
h-index: 2
Wuheng Xu
Wuheng Xu
Citations: 6
h-index: 2
Haonan Li
Haonan Li
Citations: 112
h-index: 4
Dezhi Peng
Dezhi Peng
Citations: 1,324
h-index: 18
Minghui Liao
Minghui Liao
Citations: 245
h-index: 5
Jihao Wu
Jihao Wu
Citations: 282
h-index: 6
Hao Wang
Hao Wang
Citations: 11
h-index: 2
Ziming Li
Ziming Li
Citations: 491
h-index: 13
Qiaoyu Luo
Qiaoyu Luo
Citations: 2
h-index: 1
Hao Ren
Hao Ren
Citations: 72
h-index: 5
Zihao Chen
Zihao Chen
Citations: 2
h-index: 1

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.

0 Citations
0 Influential
9 Altmetric
45.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!