2605.30244v1 May 28, 2026 cs.CV

Reinforcement Learning with Robust Rubric Rewards

Dandan Tu

Citations: 50

h-index: 3

Yong Liao

Citations: 9

h-index: 2

Ya-Qi Yu

Citations: 75

h-index: 3

Fang Hong

Citations: 2

h-index: 1

Xiangyan Qu

Citations: 35

h-index: 3

Gaojie Wu

Citations: 250

h-index: 5

Nuo Xu

Citations: 168

h-index: 2

Huixin Wang

Citations: 11

h-index: 2

Wuheng Xu

Citations: 6

h-index: 2

Haonan Li

Citations: 112

h-index: 4

Dezhi Peng

Citations: 1,324

h-index: 18

Minghui Liao

Citations: 245

h-index: 5

Jihao Wu

Citations: 282

h-index: 6

Hao Wang

Citations: 11

h-index: 2

Ziming Li

Citations: 491

h-index: 13

Qiaoyu Luo

Citations: 2

h-index: 1

Hao Ren

Citations: 72

h-index: 5

Zihao Chen

Citations: 2

h-index: 1

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!