2601.08139v1 Jan 13, 2026 cs.CV

비전-언어 모델의 테스트 시간 적응을 위한 서브스페이스 정렬

Subspace Alignment for Vision-Language Model Test-time Adaptation

Xuying Ning

Citations: 242

h-index: 10

Ruizhong Qiu

Citations: 754

h-index: 15

Xiao Lin

University of Illinois Urbana-Champaign

Citations: 241

h-index: 10

Wenxuan Bao

Citations: 77

h-index: 6

Hanghang Tong

Citations: 220

h-index: 8

Jingrui He

Citations: 41

h-index: 4

Tianxin Wei

Citations: 355

h-index: 11

Zhichen Zeng

University of Illinois Urbana-Champaign

Citations: 587

h-index: 15

Yuchen Yan

Citations: 264

h-index: 9

Cheng Luo

Citations: 128

h-index: 3

M. Cheng

Citations: 58

h-index: 5

비전-언어 모델(VLM)은 뛰어난 제로샷 성능을 보이지만, 데이터 분포 변화에 취약합니다. 테스트 시간 적응(TTA)은 VLM을 비표시된 테스트 데이터에 실시간으로 적응시키는 주요 전략으로 부상하고 있습니다. 그러나 기존의 TTA 방법은 자기 학습을 위한 가짜 레이블로 제로샷 예측에 크게 의존하는데, 이는 데이터 분포 변화 하에서 신뢰성이 떨어질 수 있으며, 두 가지 근본적인 한계로 인해 적응을 잘못 인도할 수 있습니다. 첫째, (모달리티 격차) 데이터 분포 변화는 시각 및 텍스트 모달리티 간의 격차를 유발하여, 양방향 모달리티 관계를 부정확하게 만듭니다. 둘째, (시각적 잡음) 시각적 임베딩은 풍부하지만 작업과 관련 없는 노이즈를 포함하며, 이는 데이터 분포 변화 하에서 작업별 의미론을 압도하는 경우가 많습니다. 이러한 한계를 해결하기 위해, 우리는 제로샷 예측을 개선하여 TTA 프로세스를 더 잘 안내하기 위해 양쪽 모달리티의 의미론적 서브스페이스를 정렬하는 SubTTA를 제안합니다. 모달리티 격차를 해소하기 위해, SubTTA는 양쪽 모달리티의 주요 서브스페이스를 추출하고, 시각적 매니폴드를 텍스트 의미론적 기준점에 정렬하여, 두 서브스페이스 간의 코달 거리를 최소화합니다. 시각적 잡음을 제거하기 위해, SubTTA는 정렬된 시각적 특징을 작업별 텍스트 서브스페이스로 투영하여, 시각적 임베딩을 유효한 의미론적 범위 내에 제한함으로써 작업과 관련 없는 노이즈를 필터링하고, 정제된 공간에서 추가적인 표준 TTA를 수행하여 의사 결정 경계를 개선합니다. 다양한 벤치마크와 VLM 아키텍처에 대한 광범위한 실험은 SubTTA의 효과를 입증하며, 최첨단 TTA 방법에 비해 평균 2.24%의 성능 향상을 보였습니다.

Original Abstract

Vision-language models (VLMs), despite their extraordinary zero-shot capabilities, are vulnerable to distribution shifts. Test-time adaptation (TTA) emerges as a predominant strategy to adapt VLMs to unlabeled test data on the fly. However, existing TTA methods heavily rely on zero-shot predictions as pseudo-labels for self-training, which can be unreliable under distribution shifts and misguide adaptation due to two fundamental limitations. First (Modality Gap), distribution shifts induce gaps between visual and textual modalities, making cross-modal relations inaccurate. Second (Visual Nuisance), visual embeddings encode rich but task-irrelevant noise that often overwhelms task-specific semantics under distribution shifts. To address these limitations, we propose SubTTA, which aligns the semantic subspaces of both modalities to enhance zero-shot predictions to better guide the TTA process. To bridge the modality gap, SubTTA extracts the principal subspaces of both modalities and aligns the visual manifold to the textual semantic anchor by minimizing their chordal distance. To eliminate visual nuisance, SubTTA projects the aligned visual features onto the task-specific textual subspace, which filters out task-irrelevant noise by constraining visual embeddings within the valid semantic span, and standard TTA is further performed on the purified space to refine the decision boundaries. Extensive experiments on various benchmarks and VLM architectures demonstrate the effectiveness of SubTTA, yielding an average improvement of 2.24% over state-of-the-art TTA methods.

6 Citations

0 Influential

7.5 Altmetric

43.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!