2604.22498v1 Apr 24, 2026 cs.CV

CGC: 구성 기반 대비 학습을 통한 미세 세분화 다중 이미지 이해

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

Xintian Shen

Citations: 14

h-index: 2

Jiawei Chen

Citations: 28

h-index: 3

Lihao Zheng

Citations: 14

h-index: 2

Tao Wei

Citations: 13

h-index: 2

Zhou Yu

Citations: 20

h-index: 1

Zhenwei Shao

Citations: 103

h-index: 5

Yan Yang

Citations: 1

h-index: 1

Haochi Ma

Citations: 0

h-index: 0

다중 모달 대규모 언어 모델(MLLM)은 빠른 발전을 이루었지만, 여전히 미세 세분화된 다중 이미지 이해에서 상당한 어려움을 겪으며, 공간적 환상, 어텐션 누수, 객체 불변성 실패 등의 문제가 발생합니다. 또한, 기존 접근 방식은 일반적으로 비용이 많이 드는 인간 어노테이션 또는 대규모 연쇄적 사고(CoT) 데이터 생성에 의존합니다. 본 연구에서는 MLLM의 미세 세분화된 다중 이미지 이해 능력을 향상시키기 위한 저비용의 완전 프레임워크인 Compositional Grounded Contrast (CGC)를 제안합니다. CGC는 기존의 단일 이미지 지칭 어노테이션을 기반으로, Inter-Image Contrast와 Intra-Image Contrast를 통해 구성적인 다중 이미지 학습 데이터를 구축합니다. Inter-Image Contrast는 의미적으로 분리된 방해 요소 맥락을 활용하여 이미지 간의 구분을 강화하고, Intra-Image Contrast는 상관 관계가 있는 다양한 시점 샘플을 사용하여 객체 불변성을 향상시킵니다. 또한, CGC는 GRPO 프레임워크 내에서 Rule-Based Spatial Reward를 도입하여 Think-before-Grounding 패러다임 하에서 원본 이미지 속성 부여, 공간 정렬, 구조화된 출력의 유효성을 개선합니다. 실험 결과, CGC는 MIG-Bench 및 VLM2-Bench를 포함한 미세 세분화 다중 이미지 벤치마크에서 최첨단 성능을 달성했습니다. 학습된 다중 이미지 이해 능력은 더 광범위한 다중 모달 이해 및 추론 작업에도 적용되어, MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), 및 BLINK (+1.69)에서 Qwen3-VL-8B 기본 모델 대비 일관된 성능 향상을 보였습니다.

Original Abstract

Although Multimodal Large Language Models (MLLMs) have advanced rapidly, they still face notable challenges in fine-grained multi-image understanding, often exhibiting spatial hallucination, attention leakage, and failures in object constancy. In addition, existing approaches typically rely on expensive human annotations or large-scale chain-of-thought (CoT) data generation. We propose Compositional Grounded Contrast (abbr. CGC), a low-cost full framework for boosting fine-grained multi-image understanding of MLLMs. Built on existing single-image grounding annotations, CGC constructs compositional multi-image training instances through Inter-Image Contrast and Intra-Image Contrast, which introduce semantically decoupled distractor contexts for cross-image discrimination and correlated cross-view samples for object constancy, respectively. CGC further introduces a Rule-Based Spatial Reward within the GRPO framework to improve source-image attribution, spatial alignment, and structured output validity under a Think-before-Grounding paradigm. Experiments show that CGC achieves state-of-the-art results on fine-grained multi-image benchmarks, including MIG-Bench and VLM2-Bench. The learned multi-image understanding capability also transfers to broader multimodal understanding and reasoning tasks, yielding consistent gains over the Qwen3-VL-8B base model on MathVista (+2.90), MuirBench (+2.88), MMStar (+1.93), MMMU (+1.77), and BLINK (+1.69).

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!