2602.11146v1 Feb 11, 2026 cs.CV

VLM 기반 보상 체계를 넘어: 확산 모델에 내재된 잠재 보상 모델링

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

Gongye Liu

Citations: 1,035

h-index: 9

Bo Yang

Citations: 338

h-index: 7

Yida Zhi

Citations: 0

h-index: 0

Lei Ke

Citations: 2,586

h-index: 23

Didan Deng

Citations: 2

h-index: 1

Han Gao

Citations: 53

h-index: 1

Yongxiang Huang

Citations: 4

h-index: 2

Kaihao Zhang

Citations: 148

h-index: 6

Hongbo Fu

Citations: 16

h-index: 2

Wenhan Luo

Citations: 11

h-index: 1

Zhizhou Zhong

Citations: 207

h-index: 3

확산 모델 및 플로우 매칭 모델의 선호도 최적화는 판별력과 계산 효율성이 모두 뛰어난 보상 함수에 의존합니다. 비전-언어 모델(VLM)은 풍부한 다중 모드 정보를 활용하여 정렬을 안내하는 주요 보상 제공자로 부상했습니다. 그러나 VLM의 계산 및 메모리 비용은 상당할 수 있으며, 픽셀 공간 기반 보상을 통해 잠재 확산 생성기를 최적화하는 것은 도메인 불일치를 야기하여 정렬을 복잡하게 만듭니다. 본 논문에서는 노이즈가 있는 확산 상태에 직접적으로 선호도 학습을 수행하는 확산 모델에 내재된 잠재 보상 모델인 DiNa-LRM을 제안합니다. 우리의 방법은 확산 노이즈에 의존적인 불확실성을 갖는 노이즈 보정된 Thurstone 가능도를 도입합니다. DiNa-LRM은 사전 학습된 잠재 확산 백본과 시간 단계에 조건화된 보상 헤드를 활용하며, 추론 시 노이즈 앙상블을 지원하여 테스트 시 스케일링 및 강력한 보상을 위한 확산 모델에 내재된 메커니즘을 제공합니다. 이미지 정렬 벤치마크에서 DiNa-LRM은 기존의 확산 기반 보상 기준을 크게 능가하며, 계산 비용의 일부로 최첨단 VLM과 경쟁력 있는 성능을 달성합니다. 선호도 최적화에서 DiNa-LRM은 선호도 최적화 과정을 개선하여 모델 정렬을 더 빠르고 효율적으로 만듭니다.

Original Abstract

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

0 Citations

0 Influential

11.5 Altmetric

57.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!