2604.11626v1 Apr 13, 2026 cs.AI

RationalRewards: 시각적 생성 모델의 학습 및 테스트 시기에 적용 가능한, 추론 기반 보상 스케일 시각화 생성

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Haozhe Wang

Citations: 153

h-index: 4

Weiming Ren

Citations: 4,217

h-index: 12

Jiaming Liu

Citations: 183

h-index: 5

Wenhu Chen

Citations: 4,937

h-index: 16

Cong Wei

Citations: 3,094

h-index: 12

Fangzhen Lin

Citations: 515

h-index: 6

시각적 생성 모델을 위한 대부분의 보상 모델은 풍부한 인간의 판단을 설명되지 않은 단일 점수로 축소하여 선호도의 근본적인 이유를 간과합니다. 본 연구에서는 보상 모델이 점수를 매기기 전에 명시적인 다차원 비판을 생성하도록 훈련시키는 것이, 수동적인 평가 도구에서 능동적인 최적화 도구로 변환하여 생성 모델을 두 가지 상호 보완적인 방식으로 개선할 수 있음을 보여줍니다. 학습 시에는 구조화된 논리가 강화 학습을 위한 해석 가능하고 세분화된 보상을 제공하며, 테스트 시에는 '생성-비판-수정' 루프가 비판을 목표 지향적인 프롬프트 수정으로 변환하여 매개변수 업데이트 없이 출력 결과를 개선합니다. 이러한 보상 모델을 비용이 많이 드는 논리 데이터 없이 훈련하기 위해, 우리는 Preference-Anchored Rationalization (PARROT)이라는 체계적인 프레임워크를 도입합니다. 이 프레임워크는 앵커 생성, 일관성 필터링 및 증류를 통해 쉽게 구할 수 있는 선호도 데이터를 기반으로 고품질 논리를 복원합니다. 결과적으로, RationalRewards (8B) 모델은 오픈 소스 보상 모델 중에서 최첨단 수준의 선호도 예측 성능을 달성하며, Gemini-2.5-Pro와 경쟁력 있는 성능을 보입니다. 또한, 비교 가능한 기본 모델보다 10~20배 적은 훈련 데이터를 사용합니다. 강화 학습 보상으로 사용될 때, RationalRewards는 텍스트-이미지 생성 및 이미지 편집 생성 모델을 스칼라 기반의 대안보다 지속적으로 개선합니다. 더욱 주목할 만한 점은, 테스트 시기의 비판 및 수정 루프가 여러 벤치마크에서 강화 학습 기반의 미세 조정과 동등하거나 더 나은 성능을 보인다는 것입니다. 이는 구조화된 추론이 기존 생성 모델에 잠재되어 있지만, 최적이 아닌 프롬프트로는 이끌어낼 수 없는 잠재력을 활용할 수 있음을 시사합니다.

Original Abstract

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!