2602.04337v1 Feb 04, 2026 cs.CV

인간 어노테이션 없이 사전 학습된 시각-언어 모델을 미세 조정하는 방법

Fine-tuning Pre-trained Vision-Language Models in a Human-Annotation-Free Manner

Qian-Wei Wang

Citations: 25

h-index: 3

Yaguang Song

Citations: 0

h-index: 0

Shu-Tao Xia

Citations: 12

h-index: 2

G. MEng

Citations: 52

h-index: 4

R. Cai

Citations: 5

h-index: 1

CLIP과 같은 대규모 시각-언어 모델(VLM)은 뛰어난 제로샷 일반화 능력을 보이지만, 다운스트림 작업에 적용하기 위해서는 일반적으로 비용이 많이 드는 레이블이 지정된 데이터가 필요합니다. 기존의 비지도 자기 학습 방법은 가짜 레이블을 사용하지만, 종종 신뢰할 수 없는 신뢰도 필터링, 확증 편향 및 낮은 신뢰도 샘플의 활용 부족 문제를 겪습니다. 우리는 이중 모델 기반의 크로스 모달 협력 메커니즘을 활용하는 비지도 적응 프레임워크인 협업 미세 조정(CoFT)을 제안합니다. CoFT는 샘플에 따라 달라지는 가짜 레이블의 신뢰도를 명시적으로 모델링하기 위해 양수 및 음수 텍스트 프롬프트를 사용하는 이중 프롬프트 학습 전략을 도입하여, 수동으로 설정된 임계값이나 노이즈 가정의 필요성을 없앱니다. 음수 프롬프트는 또한 경량 시각적 적응 모듈을 정규화하여 노이즈가 있는 지도 환경에서의 견고성을 향상시킵니다. CoFT는 두 단계의 학습 방식을 사용하며, 고신뢰도 샘플에 대한 파라미터 효율적인 미세 조정을 거친 후, 협업적으로 필터링된 가짜 레이블에 의해 안내되는 전체 미세 조정을 수행합니다. CoFT+는 CoFT를 기반으로 반복적인 미세 조정, 모멘텀 대비 학습 및 LLM에서 생성된 프롬프트를 통해 적응을 더욱 향상시킵니다. 광범위한 실험 결과, CoFT는 기존의 비지도 방법에 비해 일관된 성능 향상을 보이며, 심지어 소량의 지도 데이터가 사용된 기준 모델보다도 우수한 성능을 보였습니다.

Original Abstract

Large-scale vision-language models (VLMs) such as CLIP exhibit strong zero-shot generalization, but adapting them to downstream tasks typically requires costly labeled data. Existing unsupervised self-training methods rely on pseudo-labeling, yet often suffer from unreliable confidence filtering, confirmation bias, and underutilization of low-confidence samples. We propose Collaborative Fine-Tuning (CoFT), an unsupervised adaptation framework that leverages unlabeled data through a dual-model, cross-modal collaboration mechanism. CoFT introduces a dual-prompt learning strategy with positive and negative textual prompts to explicitly model pseudo-label cleanliness in a sample-dependent manner, removing the need for hand-crafted thresholds or noise assumptions. The negative prompt also regularizes lightweight visual adaptation modules, improving robustness under noisy supervision. CoFT employs a two-phase training scheme, transitioning from parameter-efficient fine-tuning on high-confidence samples to full fine-tuning guided by collaboratively filtered pseudo-labels. Building on CoFT, CoFT+ further enhances adaptation via iterative fine-tuning, momentum contrastive learning, and LLM-generated prompts. Extensive experiments demonstrate consistent gains over existing unsupervised methods and even few-shot supervised baselines.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!