2603.24965v1 Mar 26, 2026 cs.CV

설명 가능한 잠재 보상을 활용한 자체 수정 이미지 생성

Self-Corrected Image Generation with Explainable Latent Rewards

Yinyi Luo

Citations: 24

h-index: 2

Hrishikesh Gokhale

Citations: 6

h-index: 1

Marios Savvides

Citations: 25

h-index: 2

Jindong Wang

Citations: 57

h-index: 2

Shengfeng He

Citations: 90

h-index: 3

텍스트-이미지 생성 분야에서 상당한 발전이 있었음에도 불구하고, 복잡한 프롬프트에 대한 출력 결과의 일관성을 유지하는 것은 여전히 어려운 과제입니다. 특히, 미세한 의미와 공간 관계를 정확하게 반영하는 것이 어렵습니다. 이러한 어려움은 생성 과정이 순방향으로 진행되기 때문에, 출력 결과를 완전히 이해하지 않고도 일관성을 예측해야 하기 때문입니다. 반면, 생성된 이미지의 평가는 상대적으로 용이합니다. 이러한 비대칭성을 고려하여, 우리는 다중 모드 대규모 언어 모델을 사용하여 설명 가능한 잠재 보상(Explainable LAtent RewarDs, xLARD)을 통해 생성을 안내하는 자체 수정 프레임워크인 xLARD를 제안합니다. xLARD는 모델이 생성한 참조를 기반으로 구조화된 피드백을 받아 잠재 표현을 개선하는 경량 수정기를 도입합니다. 핵심 구성 요소는 잠재 수정 사항을 해석 가능한 보상 신호로 변환하는 미분 가능한 매핑으로, 이미지 수준의 미분 불가능한 평가로부터 지속적인 잠재 수준의 지침을 제공합니다. 이 메커니즘을 통해 모델은 생성 과정에서 스스로 이해하고, 평가하고, 수정할 수 있습니다. 다양한 생성 및 편집 작업에 대한 실험 결과, xLARD는 의미 일관성과 시각적 충실도를 향상시키면서 생성 규칙을 유지하는 것으로 나타났습니다. 코드 및 자세한 내용은 다음 링크에서 확인할 수 있습니다: https://yinyiluo.github.io/xLARD/.

Original Abstract

Despite significant progress in text-to-image generation, aligning outputs with complex prompts remains challenging, particularly for fine-grained semantics and spatial relations. This difficulty stems from the feed-forward nature of generation, which requires anticipating alignment without fully understanding the output. In contrast, evaluating generated images is more tractable. Motivated by this asymmetry, we propose xLARD, a self-correcting framework that uses multimodal large language models to guide generation through Explainable LAtent RewarDs. xLARD introduces a lightweight corrector that refines latent representations based on structured feedback from model-generated references. A key component is a differentiable mapping from latent edits to interpretable reward signals, enabling continuous latent-level guidance from non-differentiable image-level evaluations. This mechanism allows the model to understand, assess, and correct itself during generation. Experiments across diverse generation and editing tasks show that xLARD improves semantic alignment and visual fidelity while maintaining generative priors. Code is available at https://yinyiluo.github.io/xLARD/.

1 Citations

0 Influential

1.5 Altmetric

8.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!