2603.02557v1 Mar 03, 2026 cs.CV

CAPT: 혼동 인지 프롬프트 튜닝을 통한 시각-언어 불일치 감소

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

Yutong Gao

Citations: 66

h-index: 4

Maoyuan Shao

Citations: 0

h-index: 0

Xin Huang

Citations: 15

h-index: 2

Chuang Zhu

Citations: 1

h-index: 1

Lijuan Sun

Citations: 14

h-index: 1

Guoshun Nan

Citations: 4

h-index: 1

CLIP과 같은 시각-언어 모델은 교차 모달 표현 학습에서 상당한 발전을 이루었지만, 시각적으로나 의미적으로 유사한 범주 간에 체계적인 오분류 문제를 겪습니다. 우리는 이러한 혼동 패턴이 무작위적인 것이 아니라 특정 범주 쌍 간에 지속적으로 발생하며, 이는 모델의 내재적인 편향과 제한적인 미세한 구별 능력을 드러낸다고 관찰했습니다. 이러한 문제를 해결하기 위해, 모델이 자체적인 불일치를 학습할 수 있도록 하는 혼동 인지 프롬프트 튜닝 프레임워크인 CAPT를 제안합니다. 구체적으로, 우리는 범주 간의 안정적인 혼동 관계와 오분류된 샘플을 명시적으로 모델링하는 혼동 뱅크를 구축합니다. 이를 바탕으로, 우리는 의미적 차이와 공통성 프롬프트를 통해 전반적인 범주 간 혼동을 파악하는 Semantic Confusion Miner (SEM)와, 뱅크에서 대표적인 오분류된 인스턴스를 검색하고 전역 및 지역 컨텍스트를 통합하는 Diff-Manner Adapter를 통해 샘플 수준의 단서를 파악하는 Sample Confusion Miner (SAM)을 도입합니다. 또한, 다양한 수준의 혼동 정보를 통합하기 위해, 의미적 및 샘플 수준 전문가를 공동으로 활용하여 보다 강력한 혼동 인지 추론을 수행하는 Multi-Granularity Difference Expert (MGDE) 모듈을 설계했습니다. 11개의 벤치마크 데이터 세트에 대한 광범위한 실험 결과, 제안하는 방법은 혼동으로 인한 오류를 크게 줄이는 동시에 기본 및 새로운 범주의 구별력과 일반화 능력을 향상시키며, 혼동될 수 있는 샘플 쌍의 50.72%를 성공적으로 해결합니다. 코드는 https://github.com/greatest-gourmet/CAPT에서 공개될 예정입니다.

Original Abstract

Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative misclassified instances from the bank and capture sample-level cues through a Diff-Manner Adapter that integrates global and local contexts. To further unify confusion information across different granularities, a Multi-Granularity Difference Expert (MGDE) module is designed to jointly leverage semantic- and sample-level experts for more robust confusion-aware reasoning. Extensive experiments on 11 benchmark datasets demonstrate that our method significantly reduces confusion-induced errors while enhancing the discriminability and generalization of both base and novel classes, successfully resolving 50.72 percent of confusable sample pairs. Code will be released at https://github.com/greatest-gourmet/CAPT.

0 Citations

0 Influential

25.4657359028 Altmetric

127.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!