2603.22228v1 Mar 23, 2026 cs.CV

SpatialReward: 검증 가능한 공간 보상 모델을 통한 텍스트-이미지 생성에서의 세밀한 공간 일관성 확보

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation

Zhibin Wang

Citations: 227

h-index: 8

Jun Song

Citations: 29

h-index: 3

Sashuai Zhou

Citations: 49

h-index: 4

Junpeng Ma

Citations: 8

h-index: 2

Chengjun Yu

Citations: 41

h-index: 3

Qiang Zhou

Citations: 148

h-index: 7

Yue Cao

Citations: 1,811

h-index: 5

Ruofan Hu

Citations: 88

h-index: 4

Ziang Zhang

Citations: 548

h-index: 11

Xiaoda Yang

Citations: 447

h-index: 9

Bo Zheng

Citations: 115

h-index: 5

Zhou Zhao

Citations: 94

h-index: 5

최근 강화 학습(RL)을 활용한 텍스트-이미지(T2I) 생성 기술은 의미적 일관성과 시각적 품질을 평가하는 보상 모델의 발전에 힘입어 발전해 왔습니다. 그러나 대부분의 기존 보상 모델은 세밀한 공간 관계에 대한 고려가 부족하여, 전체적으로는 타당해 보이지만 객체 배치에 부정확한 이미지를 생성하는 경우가 많습니다. 본 연구에서는 생성된 이미지의 공간적 구성을 명시적으로 평가하도록 설계된 검증 가능한 보상 모델인 extbf{SpatialReward}를 제안합니다. SpatialReward는 다단계 파이프라인을 채택합니다. 먼저, extit{Prompt Decomposer}는 자유 형식의 프롬프트로부터 개체, 속성 및 공간 메타데이터를 추출합니다. 전문가 수준의 검출기는 객체의 위치와 속성에 대한 정확한 시각적 정보를 제공하며, 시각-언어 모델은 검토된 정보를 바탕으로 사슬 추론을 수행하여, 규칙 기반 방법으로는 해결하기 어려운 복잡한 공간 관계를 평가합니다. 생성된 이미지의 공간 관계를 보다 포괄적으로 평가하기 위해, 객체 속성, 방향, 객체 간 관계 및 렌더링된 텍스트 배치 등을 포함하는 벤치마크인 extbf{SpatRelBench}를 소개합니다. Stable Diffusion 및 FLUX에 대한 실험 결과, SpatialReward를 RL 훈련에 통합하면 공간 일관성과 전체적인 생성 품질이 꾸준히 향상되며, 결과가 인간의 판단과 더 일치하는 것으로 나타났습니다. 이러한 결과는 검증 가능한 보상 모델이 텍스트-이미지 생성 모델에서 더욱 정확하고 제어 가능한 최적화를 가능하게 할 수 있는 상당한 잠재력을 가지고 있음을 시사합니다.

Original Abstract

Recent advances in text-to-image (T2I) generation via reinforcement learning (RL) have benefited from reward models that assess semantic alignment and visual quality. However, most existing reward models pay limited attention to fine-grained spatial relationships, often producing images that appear plausible overall yet contain inaccuracies in object positioning. In this work, we present \textbf{SpatialReward}, a verifiable reward model explicitly designed to evaluate spatial layouts in generated images. SpatialReward adopts a multi-stage pipeline: a \emph{Prompt Decomposer} extracts entities, attributes, and spatial metadata from free-form prompts; expert detectors provide accurate visual grounding of object positions and attributes; and a vision-language model applies chain-of-thought reasoning over grounded observations to assess complex spatial relations that are challenging for rule-based methods. To more comprehensively evaluate spatial relationships in generated images, we introduce \textbf{SpatRelBench}, a benchmark covering object attributes, orientation, inter-object relations, and rendered text placement. Experiments on Stable Diffusion and FLUX show that incorporating SpatialReward into RL training consistently improves spatial consistency and overall generation quality, with results aligned more closely to human judgments. These findings indicate that verifiable reward models hold considerable potential for enabling more accurate and controllable optimization in text-to-image generation models.

5 Citations

0 Influential

5.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!