2601.20419v1 Jan 28, 2026 cs.CV

BiFTA: 세밀한 텍스트-시각 정렬을 위한 양방향 정제 방법

Let's Roll a BiFTA: Bi-refinement for Fine-grained Text-visual Alignment in Vision-Language Models

Yuhao Sun

Citations: 7

h-index: 2

C. Cai

Citations: 49

h-index: 4

Jiacheng Zhang

Citations: 50

h-index: 2

Zesheng Ye

Citations: 47

h-index: 3

Xin Yuan

Citations: 40

h-index: 4

Feng Liu

Citations: 109

h-index: 1

최근 연구에 따르면, 세밀하게 정의된 텍스트 설명을 이미지의 특정 영역과 정렬하는 것은 사전 학습된 시각-언어 모델(예: CLIP)의 제로샷 성능을 크게 향상시킬 수 있습니다. 그러나 세밀한 텍스트 설명과 이미지 영역 모두 종종 중복된 정보를 포함하고 있어 텍스트-시각 정렬의 효과를 저해할 수 있습니다. 본 논문에서는 이 문제를 두 가지 관점, 즉 '시각 정제'와 '설명 정제'를 통해 해결하고자 하며, 이를 '세밀한 텍스트-시각 정렬을 위한 양방향 정제 방법(BiFTA)'이라고 명명합니다. '시각 정제'는 높은 IoU(Intersection over Union) 비율을 갖는 중복된 이미지 영역을 제거하여 더욱 뚜렷한 시각적 샘플을 얻습니다. '설명 정제'는 높은 쌍별 코사인 유사도를 갖는 중복된 텍스트 설명을 제거하여 남은 설명의 다양성을 확보합니다. BiFTA는 ViT 기반 및 ResNet 기반의 CLIP 모델 모두에서 6개의 벤치마크 데이터 세트에 대해 우수한 제로샷 성능을 달성했으며, 이는 시각-텍스트 정렬에서 중복 정보를 제거하는 것의 중요성을 입증합니다.

Original Abstract

Recent research has shown that aligning fine-grained text descriptions with localized image patches can significantly improve the zero-shot performance of pre-trained vision-language models (e.g., CLIP). However, we find that both fine-grained text descriptions and localized image patches often contain redundant information, making text-visual alignment less effective. In this paper, we tackle this issue from two perspectives: \emph{View Refinement} and \emph{Description refinement}, termed as \textit{\textbf{Bi}-refinement for \textbf{F}ine-grained \textbf{T}ext-visual \textbf{A}lignment} (BiFTA). \emph{View refinement} removes redundant image patches with high \emph{Intersection over Union} (IoU) ratios, resulting in more distinctive visual samples. \emph{Description refinement} removes redundant text descriptions with high pairwise cosine similarity, ensuring greater diversity in the remaining descriptions. BiFTA achieves superior zero-shot performance on 6 benchmark datasets for both ViT-based and ResNet-based CLIP, justifying the necessity to remove redundant information in visual-text alignment.

1 Citations

1 Influential

2 Altmetric

13.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!