2602.02977v1 Feb 03, 2026 cs.CV

이미지와 긴 설명을 이용한 시각 기반 이해를 위한 '숲'과 '나무'의 정렬: 시각적 정보를 기반으로 한 이해

Aligning Forest and Trees in Images and Long Captions for Visually Grounded Understanding

Byeongju Woo

Citations: 28

h-index: 1

Byeonghyun Pak

Citations: 44

h-index: 2

Sangwoo Mo

Citations: 27

h-index: 2

Stella X. Yu

Citations: 12

h-index: 1

Zilin Wang

Citations: 310

h-index: 6

CLIP과 같은 대규모 시각-언어 모델은 긴 설명을 처리하는 데 어려움을 겪는데, 이는 이미지와 텍스트를 구별 없이 전체적인 단위로 정렬하기 때문입니다. 세밀한 시각-언어 이해를 위해서는 전체적인 맥락과 시각 및 텍스트 도메인 전반에 걸친 세부 사항을 모두 포착하는 계층적 의미론이 필요합니다. 그러나 언어적 계층 구조(구문 또는 의미론)는 시각적 구성과 잘 일치하지 않으며, 순전히 시각적인 계층 구조는 장면을 의미적 초점에 맞추지 않고 외관에 따른 부분으로 분할하는 경향이 있습니다. 본 연구에서는 CAFT(Cross-domain Alignment of Forests and Trees)라는 계층적 이미지-텍스트 표현 학습 프레임워크를 제안합니다. CAFT는 픽셀 수준의 지도 없이 이미지와 긴 설명 간의 전역적 및 지역적 의미를 정렬합니다. 세밀한 시각 인코더와 계층적 텍스트 트랜스포머를 결합하여, CAFT는 전체 이미지와 전체 설명을 매칭하는 계층적 정렬 손실을 사용하며, 영역-문장 간의 대응 관계에 영향을 주어, 세밀한 증거로부터 전체적인 의미를 구축하도록 합니다. 3천만 개의 이미지-텍스트 쌍으로 학습된 CAFT는 6개의 긴 텍스트 검색 벤치마크에서 최첨단 성능을 달성했으며, 뛰어난 확장성을 보입니다. 실험 결과, 계층적 교차 도메인 정렬을 통해 명시적인 영역 수준의 지도 없이도 세밀하고 시각적으로 기반한 이미지-텍스트 표현이 생성될 수 있음을 확인했습니다.

Original Abstract

Large vision-language models such as CLIP struggle with long captions because they align images and texts as undifferentiated wholes. Fine-grained vision-language understanding requires hierarchical semantics capturing both global context and localized details across visual and textual domains. Yet linguistic hierarchies from syntax or semantics rarely match visual organization, and purely visual hierarchies tend to fragment scenes into appearance-driven parts without semantic focus. We propose CAFT (Cross-domain Alignment of Forests and Trees), a hierarchical image-text representation learning framework that aligns global and local semantics across images and long captions without pixel-level supervision. Coupling a fine-to-coarse visual encoder with a hierarchical text transformer, it uses a hierarchical alignment loss that matches whole images with whole captions while biasing region-sentence correspondences, so that coarse semantics are built from fine-grained evidence rather than from aggregation untethered to part-level grounding. Trained on 30M image-text pairs, CAFT achieves state-of-the-art performance on six long-text retrieval benchmarks and exhibits strong scaling behavior. Experiments show that hierarchical cross-domain alignment enables fine-grained, visually grounded image-text representations to emerge without explicit region-level supervision.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!