2603.17655v1 Mar 18, 2026 cs.CV

정교화된 타겟 도메인 로컬 정렬을 통한 해석 가능한 크로스 도메인 퓨샷 학습

Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

Yixiong Zou

Citations: 285

h-index: 9

Yuhua Li

Citations: 248

h-index: 9

Ruixuan Li

Citations: 339

h-index: 11

Yaze Zhao

Citations: 1

h-index: 1

크로스 도메인 퓨샷 학습(CDFSL)은 대규모 일반 데이터(소스 도메인)로 학습된 모델을 소량의 학습 데이터만 존재하는 하위 타겟 도메인에 적용하는 기술이며, 비전-언어 모델(예: CLIP)에 대한 연구는 아직 초기 단계에 있습니다. 의료 진단과 같은 일반적인 하위 도메인은 해석 가능한 인식을 위해 미세한 시각적 단서가 필요하지만, 현재 튜닝된 CLIP 모델은 이러한 단서에 집중하기 어렵다는 점을 발견했습니다. 이러한 모델들은 소스 도메인의 중요한 영역에는 대략적으로 집중할 수 있지만, 현재 연구들은 CLIP이 로컬의 미세한 패턴을 제대로 포착하지 못한다는 점을 보여주었습니다. 본 논문에서는 현재 연구 결과와 더불어, 도메인 간 격차와 부족한 학습 데이터가 이러한 단점을 더욱 악화시킨다는 것을 확인했습니다. 특히, 전체적인 패턴보다 로컬 미세 패턴에 더 큰 영향을 미치는데, 이를 우리는 CLIP 기반 CDFSL에서의 '로컬 정렬 불일치 문제'라고 명명합니다. 이러한 문제를 해결하기 위해, 로컬 시각적 특징과 텍스트 의미를 정렬하는 데 필요한 지도 정보가 부족하기 때문에, 우리는 자체 지도 학습 정보를 활용합니다. 번역 작업에서 영감을 받아, 우리는 로컬 시각적 특징을 텍스트 특징으로 변환하고, 다시 텍스트 특징을 시각적 특징으로 변환하는(그리고 그 반대) '사이클 일관성'을 갖는 CC-CDFSL 방법을 제안합니다. 또한, 시각 모달리티에서 유입되는 노이즈를 줄이기 위해, '의미적 앵커(Semantic Anchor)' 메커니즘을 추가로 제안합니다. 이 메커니즘은 먼저 시각적 특징을 증강하여 텍스트-이미지 매핑을 위한 더 큰 데이터셋을 제공하고, 이미지 특징을 축소하여 관련 없는 이미지-텍스트 매핑을 제거합니다. 다양한 벤치마크, 백본 및 튜닝 방법을 사용한 광범위한 실험 결과, 우리는 (1) 로컬 비전-언어 정렬을 효과적으로 개선하고, (2) 패치를 시각화하여 학습된 패턴과 모델 결정의 해석 가능성을 향상시키며, (3) 최첨단 성능을 달성할 수 있음을 보여줍니다.

Original Abstract

Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.

1 Citations

0 Influential

5.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!