2602.03448v1 Feb 03, 2026 cs.CV

계층적 개념-외형 지향 다중 피사체 이미지 생성

Hierarchical Concept-to-Appearance Guidance for Multi-Subject Image Generation

Zihao Wang

Citations: 124

h-index: 7

Jin Cui

Citations: 2

h-index: 1

Yijia Xu

Citations: 404

h-index: 10

다중 피사체 이미지 생성은 여러 참조 피사체의 특징을 충실하게 유지하면서 텍스트 지침을 따르는 이미지를 합성하는 것을 목표로 합니다. 그러나 기존 방법은 종종 텍스트 프롬프트를 참조 이미지와 암묵적으로 연결하는 확산 모델에 의존하기 때문에, 신분 불일치와 제한적인 구도 제어 문제를 겪습니다. 본 연구에서는 고수준 개념에서부터 세부적인 외형에 이르기까지 명시적이고 체계적인 지침을 제공하는 계층적 개념-외형 지향(CAG) 프레임워크를 제안합니다. 개념 수준에서는, 참조 VAE 특징을 무작위로 생략하는 VAE 드롭아웃 훈련 전략을 도입하여 모델이 시각 언어 모델(VLM)로부터 더 강력한 의미 정보를 활용하도록 장려하고, 완전한 외형 정보가 없을 때에도 일관된 개념 수준의 생성을 촉진합니다. 외형 수준에서는, VLM에서 파생된 대응 관계를 확산 트랜스포머(DiT) 내의 대응 관계 인지 마스크드 어텐션 모듈에 통합합니다. 이 모듈은 각 텍스트 토큰이 해당되는 참조 영역에만 집중하도록 제한하여 정확한 속성 연결과 신뢰할 수 있는 다중 피사체 구성을 보장합니다. 광범위한 실험 결과, 제안하는 방법은 다중 피사체 이미지 생성 분야에서 최첨단 성능을 달성하며, 프롬프트 준수 및 피사체 일관성을 크게 향상시키는 것을 보여줍니다.

Original Abstract

Multi-subject image generation aims to synthesize images that faithfully preserve the identities of multiple reference subjects while following textual instructions. However, existing methods often suffer from identity inconsistency and limited compositional control, as they rely on diffusion models to implicitly associate text prompts with reference images. In this work, we propose Hierarchical Concept-to-Appearance Guidance (CAG), a framework that provides explicit, structured supervision from high-level concepts to fine-grained appearances. At the conceptual level, we introduce a VAE dropout training strategy that randomly omits reference VAE features, encouraging the model to rely more on robust semantic signals from a Visual Language Model (VLM) and thereby promoting consistent concept-level generation in the absence of complete appearance cues. At the appearance level, we integrate the VLM-derived correspondences into a correspondence-aware masked attention module within the Diffusion Transformer (DiT). This module restricts each text token to attend only to its matched reference regions, ensuring precise attribute binding and reliable multi-subject composition. Extensive experiments demonstrate that our method achieves state-of-the-art performance on the multi-subject image generation, substantially improving prompt following and subject consistency.

1 Citations

0 Influential

5 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!