2601.04339v2 Jan 07, 2026 cs.CV

약점 개선을 위한 추가 학습을 통한 통합 텍스트-이미지 생성

Unified Text-Image Generation with Weakness-Targeted Post-Training

Emily Dinan

Citations: 23,642

h-index: 25

Jiahui Chen

Citations: 6

h-index: 1

Philippe Hansen-Estruch

Citations: 1

h-index: 1

Xiaochuang Han

Citations: 8

h-index: 2

Yushi Hu

Citations: 331

h-index: 4

Amita Kamath

Citations: 743

h-index: 7

M. Drozdzal

Citations: 9,076

h-index: 27

Reyhane Askari Hemmat

Citations: 531

h-index: 8

Luke Zettlemoyer

Citations: 54

h-index: 4

Marjan Ghazvininejad

Citations: 19,128

h-index: 30

최근 텍스트-이미지(T2I) 합성에 있어, 텍스트와 이미지를 동시에 생성하는 통합 멀티모달 생성 아키텍처가 유망한 연구 분야로 떠오르고 있습니다. 그러나 많은 기존 시스템은 명시적인 모드 전환을 기반으로 하며, 이미지 생성을 위해 먼저 추론 텍스트를 생성합니다. 이러한 분리된, 순차적인 추론 과정은 모달 간의 상호 작용을 제한하고 자동 멀티모달 생성을 방해합니다. 본 연구에서는 완전한 통합 텍스트-이미지 생성을 달성하기 위해 추가 학습을 탐구합니다. 이를 통해 모델은 단일 추론 과정 내에서 텍스트 추론에서 시각적 합성으로 자율적으로 전환합니다. 우리는 텍스트-이미지 공동 생성이 T2I 성능에 미치는 영향과 추가 학습 과정에서 각 모달의 상대적인 중요성을 분석합니다. 또한 다양한 추가 학습 데이터 전략을 탐색하여, 특정 제한 사항을 해결하는 표적 데이터 세트가 광범위한 이미지-캡션 코퍼스 또는 벤치마크에 맞춰진 데이터보다 더 우수한 결과를 얻는다는 것을 보여줍니다. 오프라인, 보상 가중 추가 학습을 통해 완전히 자체 생성된 합성 데이터를 사용하여, 본 연구는 네 가지 다양한 T2I 벤치마크에서 멀티모달 이미지 생성 성능을 향상시킵니다. 이는 보상 가중이 양쪽 모달 모두에 효과적이며, 전략적으로 설계된 추가 학습 데이터가 중요하다는 것을 입증합니다.

Original Abstract

Unified multimodal generation architectures that jointly produce text and images have recently emerged as a promising direction for text-to-image (T2I) synthesis. However, many existing systems rely on explicit modality switching, generating reasoning text before switching manually to image generation. This separate, sequential inference process limits cross-modal coupling and prohibits automatic multimodal generation. This work explores post-training to achieve fully unified text-image generation, where models autonomously transition from textual reasoning to visual synthesis within a single inference process. We examine the impact of joint text-image generation on T2I performance and the relative importance of each modality during post-training. We additionally explore different post-training data strategies, showing that a targeted dataset addressing specific limitations achieves superior results compared to broad image-caption corpora or benchmark-aligned data. Using offline, reward-weighted post-training with fully self-generated synthetic data, our approach enables improvements in multimodal image generation across four diverse T2I benchmarks, demonstrating the effectiveness of reward-weighting both modalities and strategically designed post-training data.

1 Citations

0 Influential

15 Altmetric

76.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!