2603.07700v1 Mar 08, 2026 cs.CV

TDM-R1: 미분 불가능한 보상을 활용한 소규모 단계 확산 모델 강화

TDM-R1: Reinforcing Few-Step Diffusion Models with Non-Differentiable Reward

Weijian Luo

Citations: 188

h-index: 7

Jing Tang

Citations: 144

h-index: 5

Yihong Luo

Citations: 201

h-index: 8

Tianyang Hu

Citations: 835

h-index: 16

소규모 단계 생성 모델은 이미지 및 비디오 생성에 있어 상당한 비용 절감 효과를 제공하지만, 소규모 단계 모델에 적합한 일반적인 강화 학습(RL) 방법론은 여전히 해결해야 할 과제입니다. 기존의 소규모 단계 확산 모델을 위한 강화 학습 접근 방식은 주로 미분 가능한 보상 모델을 통해 역전파를 수행하며, 이는 인간의 이분법적 선호도, 객체 개수 등과 같은 대부분의 중요한 실제 세계 보상 신호를 배제합니다. 본 연구에서는 소규모 단계 생성 모델의 성능 향상을 위해 미분 불가능한 보상을 적절하게 통합하는 새로운 강화 학습 방법론인 TDM-R1을 제안합니다. TDM-R1은 선도적인 소규모 단계 모델인 Trajectory Distribution Matching (TDM)을 기반으로 학습 과정을 대리 보상 학습과 생성기 학습으로 분리합니다. 또한, TDM의 결정론적인 생성 경로를 따라 단계별 보상 신호를 얻는 실용적인 방법을 개발하여, 일반적인 보상을 활용하여 소규모 단계 모델의 성능을 크게 향상시키는 통합된 강화 학습 후처리 방법을 제공합니다. 본 연구에서는 텍스트 렌더링, 시각적 품질, 선호도 정렬 등 다양한 실험을 수행했습니다. 모든 결과는 TDM-R1이 소규모 텍스트-이미지 모델을 위한 강력한 강화 학습 방법론임을 보여주며, 인접 도메인 및 외부 도메인 지표에서 최첨단 강화 학습 성능을 달성합니다. 또한, TDM-R1은 최근의 강력한 Z-Image 모델에도 효과적으로 적용되어, 100-NFE 및 소규모 단계 변형 모델보다 훨씬 적은 4개의 NFE만으로도 우수한 성능을 보입니다. 프로젝트 페이지: https://github.com/Luo-Yihong/TDM-R1

Original Abstract

While few-step generative models have enabled powerful image and video generation at significantly lower cost, generic reinforcement learning (RL) paradigms for few-step models remain an unsolved problem. Existing RL approaches for few-step diffusion models strongly rely on back-propagating through differentiable reward models, thereby excluding the majority of important real-world reward signals, e.g., non-differentiable rewards such as humans' binary likeness, object counts, etc. To properly incorporate non-differentiable rewards to improve few-step generative models, we introduce TDM-R1, a novel reinforcement learning paradigm built upon a leading few-step model, Trajectory Distribution Matching (TDM). TDM-R1 decouples the learning process into surrogate reward learning and generator learning. Furthermore, we developed practical methods to obtain per-step reward signals along the deterministic generation trajectory of TDM, resulting in a unified RL post-training method that significantly improves few-step models' ability with generic rewards. We conduct extensive experiments ranging from text-rendering, visual quality, and preference alignment. All results demonstrate that TDM-R1 is a powerful reinforcement learning paradigm for few-step text-to-image models, achieving state-of-the-art reinforcement learning performances on both in-domain and out-of-domain metrics. Furthermore, TDM-R1 also scales effectively to the recent strong Z-Image model, consistently outperforming both its 100-NFE and few-step variants with only 4 NFEs. Project page: https://github.com/Luo-Yihong/TDM-R1

0 Citations

0 Influential

45.776740307447 Altmetric

228.9 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!