2604.13029v1 Apr 14, 2026 cs.CV

규칙 기반 보상을 활용한 시각적 선호도 최적화

Visual Preference Optimization with Rubric Rewards

Yong Liao

Citations: 6

h-index: 1

Haoyu Ren

Citations: 85

h-index: 6

Ya-Qi Yu

Citations: 68

h-index: 3

Fang Hong

Citations: 0

h-index: 0

Xiangyan Qu

Citations: 27

h-index: 2

Gaojie Wu

Citations: 233

h-index: 4

N. Xu

Citations: 156

h-index: 2

Huixin Wang

Citations: 9

h-index: 2

Wuheng Xu

Citations: 4

h-index: 1

Haonan Li

Citations: 99

h-index: 4

Dezhi Peng

Citations: 1,252

h-index: 17

Minghui Liao

Citations: 224

h-index: 5

Jihao Wu

Citations: 251

h-index: 6

Dandan Tu

Citations: 45

h-index: 2

Hao Wang

Citations: 7

h-index: 2

Ziming Li

Citations: 414

h-index: 11

Qiaoyu Luo

Citations: 0

h-index: 0

Zihao Chen

Citations: 35

h-index: 3

다중 모드 작업에서 Direct Preference Optimization (DPO)의 효과는 품질 차이를 반영하는 선호도 데이터에 크게 의존합니다. 기존 파이프라인은 종종 오프라인 데이터를 사용하거나, 세부적인 시각적 추론에 적합하지 않은 거친 결과 기반 신호를 사용합니다. 본 연구에서는 인스턴스별 규칙(rubric)을 기반으로 하는 선호도 최적화 프레임워크인 rDPO를 제안합니다. 각 이미지-지시(instruction) 쌍에 대해, 가능한 모든 정책에서 생성된 응답을 평가하기 위한 필수 및 추가 기준을 포함하는 체크리스트 형식의 규칙을 생성합니다. 지시-규칙 풀은 오프라인으로 구축되어 온-정책 데이터 생성 과정에서 재사용됩니다. 공개적인 보상 모델링 벤치마크에서, 규칙 기반 프롬프팅은 30B-A3B 평가 모델의 성능을 크게 향상시켜 GPT-5.4에 근접하게 만듭니다. 공개적인 다운스트림 벤치마크에서, 규칙 기반 필터링은 평균 성능을 82.69%로 향상시키는 반면, 결과 기반 필터링은 81.14%에서 75.82%로 감소시킵니다. 포괄적인 벤치마크에서 확장성을 평가한 결과, rDPO는 61.01%의 성능을 달성하여, 스타일 제약 기반의 기준 모델(52.36%)을 크게 능가하고, 59.48%의 기본 모델보다 우수한 성능을 보입니다. 이러한 결과는 시각적 선호도 최적화가 온-정책 데이터 구축과 인스턴스별 기준 수준의 피드백을 결합함으로써 효과를 극대화할 수 있음을 보여줍니다.

Original Abstract

The effectiveness of Direct Preference Optimization (DPO) depends on preference data that reflect the quality differences that matter in multimodal tasks. Existing pipelines often rely on off-policy perturbations or coarse outcome-based signals, which are not well suited to fine-grained visual reasoning. We propose rDPO, a preference optimization framework based on instance-specific rubrics. For each image-instruction pair, we create a checklist-style rubric of essential and additional criteria to score responses from any possible policies. The instruction-rubric pool is built offline and reused during the construction of on-policy data. On public reward modeling benchmarks, rubric-based prompting massively improves a 30B-A3B judge and brings it close to GPT-5.4. On public downstream benchmarks, rubric-based filtering raises the macro average to 82.69, whereas outcome-based filtering drops it to 75.82 from 81.14. When evaluating scalability on a comprehensive benchmark, rDPO achieves 61.01, markedly outperforming the style-constrained baseline (52.36) and surpassing the 59.48 base model. Together, these results show that visual preference optimization benefits from combining on-policy data construction with instance-specific criterion-level feedback.

0 Citations

0 Influential

8.5 Altmetric

42.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!