2602.15368v1 Feb 17, 2026 cs.CV

GMAIL: 생성 이미지 학습을 위한 생성 모달 정렬

GMAIL: Generative Modality Alignment for generated Image Learning

Citations: 65

h-index: 2

Citations: 1,710

h-index: 22

생성 모델은 매우 사실적인 이미지를 생성하는 데 사용되어 왔으며, 이는 머신 러닝 모델 훈련을 위한 풍부한 데이터 소스를 제공할 수 있는 잠재력을 가지고 있습니다. 그러나 생성된 데이터 소스의 장점에도 불구하고, 생성된 이미지를 실제 이미지로 간주하여 훈련하는 것은 실제 및 합성 도메인 간의 모달리티 불일치로 인해 심지어 모드 붕괴를 초래할 수 있습니다. 본 논문에서는 생성된 이미지를 실제 이미지와 별개의 모달리티로 명시적으로 취급하는 새로운 프레임워크인 GMAIL을 제안합니다. 저희의 접근 방식은 픽셀 공간에서 실제 이미지를 생성된 이미지로 무분별하게 대체하는 대신, 다중 모달 학습 접근 방식을 통해 두 개의 뚜렷한 모달리티를 동일한 잠재 공간에서 연결합니다. 구체적으로, 먼저 크로스 모달 정렬 손실을 사용하여 생성된 이미지에 대해서만 모델을 미세 조정하고, 이 정렬된 모델을 사용하여 다양한 비전-언어 모델을 생성된 이미지와 함께 추가적으로 훈련합니다. 저희의 접근 방식은 두 모달리티를 정렬함으로써 최근 생성 모델 발전의 이점을 효과적으로 활용하여 다양한 비전-언어 작업에서 생성 이미지 학습의 효율성을 향상시킵니다. 저희의 프레임워크는 다양한 비전-언어 모델에 쉽게 통합될 수 있으며, 광범위한 실험을 통해 그 효과를 입증합니다. 예를 들어, 저희의 프레임워크는 이미지 캡셔닝, 제로샷 이미지 검색, 제로샷 이미지 분류 및 긴 캡션 검색 작업에서 성능을 크게 향상시킵니다. 또한 생성 데이터 증가에 따른 긍정적인 경향을 보이며, 대규모 멀티모달 모델인 LLaVA의 캡셔닝 성능을 눈에 띄게 향상시킵니다.

Original Abstract

Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined GMAIL, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves performance on image captioning, zero-shot image retrieval, zero-shot image classification, and long caption retrieval tasks. It also shows positive generated data scaling trends and notable enhancements in the captioning performance of the large multimodal model, LLaVA.

2 Citations

0 Influential

11 Altmetric

57.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!