2603.12893v1 Mar 13, 2026 cs.CV

텍스트-이미지 모델의 강화 학습 후 학습을 위한 유한 차분 흐름 최적화

Finite Difference Flow Optimization for RL Post-Training of Text-to-Image Models

Angjoo Kanazawa

Citations: 23,895

h-index: 56

David McAllister

Citations: 166

h-index: 4

M. Aittala

Citations: 21,155

h-index: 29

T. Karras

Citations: 902

h-index: 7

Janne Hellsten

Citations: 12,210

h-index: 6

Timo Aila

Citations: 52,611

h-index: 46

S. Laine

Citations: 48,759

h-index: 41

강화 학습(RL)은 이미지 품질 및 프롬프트 일관성과 같은 바람직한 측면을 명시적으로 개선하기 위해 보상 신호로부터 학습할 수 있으므로, 확산 기반 이미지 생성 모델의 후 학습을 위한 표준 기술로 자리 잡았습니다. 본 논문에서는 모델 업데이트의 분산을 줄이기 위해 쌍을 이루는 경로를 샘플링하고, 더 선호되는 이미지 방향으로 흐름 속도를 조절하는 온라인 강화 학습 방식을 제안합니다. 기존 방법이 각 샘플링 단계를 개별적인 정책 액션으로 취급하는 것과 달리, 우리는 전체 샘플링 과정을 단일 액션으로 간주합니다. 우리는 고품질의 시각 언어 모델과 기존 품질 지표를 보상으로 사용하고, 다양한 지표를 사용하여 결과를 평가했습니다. 우리의 방법은 기존 접근 방식보다 더 빠르게 수렴하며, 더 높은 출력 품질과 프롬프트 일관성을 제공합니다.

Original Abstract

Reinforcement learning (RL) has become a standard technique for post-training diffusion-based image synthesis models, as it enables learning from reward signals to explicitly improve desirable aspects such as image quality and prompt alignment. In this paper, we propose an online RL variant that reduces the variance in the model updates by sampling paired trajectories and pulling the flow velocity in the direction of the more favorable image. Unlike existing methods that treat each sampling step as a separate policy action, we consider the entire sampling process as a single action. We experiment with both high-quality vision language models and off-the-shelf quality metrics for rewards, and evaluate the outputs using a broad set of metrics. Our method converges faster and yields higher output quality and prompt alignment than previous approaches.

0 Citations

0 Influential

28 Altmetric

140.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!