2601.13622v1 Jan 20, 2026 cs.CV

CARPE: 앙상블을 통한 문맥 인식 이미지 표현 우선순위 결정 방법 - 대규모 시각-언어 모델을 위한 제안

CARPE: Context-Aware Image Representation Prioritization via Ensemble for Large Vision-Language Models

Zhe Zhao

Citations: 119

h-index: 4

Donghee Lee

Citations: 0

h-index: 0

Rui Cai

University of California, Davis

Citations: 17

h-index: 2

최근 대규모 시각-언어 모델(LVLM)의 발전은 이들을 범용 어시스턴트로 만드는 데 기여했습니다. 그러나 LVLM은 이미지 분류와 같은 시각 중심 작업에서 여전히 어려움을 겪으며, 종종 CLIP 기반 모델과 같은 기본 시각 인코더보다 성능이 떨어집니다. 이러한 한계를 해결하기 위해, 우리는 문맥 인식 이미지 표현 우선순위 결정 방법(CARPE)을 제안합니다. CARPE는 모델에 구애받지 않는 새로운 프레임워크로서, 시각 통합 계층과 문맥 인식 앙상블 전략을 도입하여, 이미지 표현을 우선시할지 또는 언어 모델의 추론 능력을 활용할지를 결정합니다. 이러한 설계는 모델이 시각적 및 텍스트 모달리티의 중요도를 적응적으로 조절하고 다양한 측면의 이미지 표현을 포착할 수 있도록 하여, 분류 및 시각-언어 벤치마크 전반에 걸쳐 일관된 성능 향상을 가져옵니다. 광범위한 실험을 통해 CARPE가 이미지 분류 벤치마크의 성능을 향상시킬 뿐만 아니라 다양한 시각-언어 벤치마크에서도 더 나은 결과를 얻는다는 것을 확인했습니다. 또한, CARPE는 시각 인코더와 언어 모델로 구성된 대부분의 오픈 소스 LVLM에 효과적으로 통합될 수 있도록 설계되어, 다양한 아키텍처에 대한 적응성을 보장합니다.

Original Abstract

Recent advancements in Large Vision-Language Models (LVLMs) have pushed them closer to becoming general-purpose assistants. Despite their strong performance, LVLMs still struggle with vision-centric tasks such as image classification, underperforming compared to their base vision encoders, which are often CLIP-based models. To address this limitation, we propose Context-Aware Image Representation Prioritization via Ensemble (CARPE), a novel, model-agnostic framework which introduces vision-integration layers and a context-aware ensemble strategy to identify when to prioritize image representations or rely on the reasoning capabilities of the language model. This design enhances the model's ability to adaptively weight visual and textual modalities and enables the model to capture various aspects of image representations, leading to consistent improvements in generalization across classification and vision-language benchmarks. Extensive experiments demonstrate that CARPE not only improves performance on image classification benchmarks but also enhances results across various vision-language benchmarks. Finally, CARPE is designed to be effectively integrated with most open-source LVLMs that consist of a vision encoder and a language model, ensuring its adaptability across diverse architectures.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!