2602.04304v1 Feb 04, 2026 cs.CV

정적 크롭핑을 넘어: 레이어 적응형 시각적 위치 결정 및 디코딩 향상

Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement

Qinglin Zhu

Citations: 18

h-index: 2

Yulan He

Citations: 264

h-index: 11

Lin Gui

Citations: 282

h-index: 11

Zipeng Zhu

Citations: 5

h-index: 2

Zhanghao Hu

Citations: 131

h-index: 2

Yuxi Hong

Citations: 2

h-index: 1

Yijun Liu

Citations: 53

h-index: 5

Jingyong Su

Citations: 68

h-index: 3

대규모 비전-언어 모델(LVLM)은 시각적 패치를 텍스트 임베딩 공간에 정렬함으로써 빠르게 발전해 왔지만, 고정된 시각적 토큰 할당량으로 인해 이미지를 일관된 사전 학습 해상도로 조정해야 하며, 이는 종종 미세한 세부 정보를 삭제하고 언어적 사전 지식에 대한 과도한 의존으로 인해 환각을 유발합니다. 최근의 어텐션 기반 향상 기술(예: 크롭핑 또는 영역 중심 어텐션 할당)은 이러한 문제를 완화하지만, 이는 종종 간단한 인식 벤치마크에서 경험적으로 선택된 정적인 '매직 레이어'에 의존하며, 따라서 복잡한 추론 작업에는 적용되지 않을 수 있습니다. 이러한 정적인 가정과 달리, 우리는 시각적 정렬에 대한 동적인 관점을 제안합니다. 레이어별 민감도 분석을 통해, 시각적 정렬은 동적인 과정임을 보여줍니다. 간단한 객체 인식 작업은 중간 레이어에 의존하는 반면, 복잡한 시각적 검색 및 추론 작업은 시각적 정보가 더 깊은 레이어에서 재활성화되어야 합니다. 이러한 관찰을 바탕으로, 우리는 Visual Activation by Query (VAQ)라는 메트릭을 도입합니다. VAQ는 입력 쿼리에 대한 어텐션 민감도를 측정하여, 쿼리별 시각적 정렬과 가장 관련성이 높은 레이어의 어텐션 맵을 식별합니다. VAQ를 기반으로, 우리는 LASER(Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning)라는, 학습이 필요 없는 추론 절차를 제안합니다. LASER는 시각적 위치 결정 및 질의 응답에 적합한 레이어를 작업에 따라 적응적으로 선택합니다. 다양한 VQA 벤치마크에서의 실험 결과, LASER는 다양한 복잡성 수준의 작업에서 VQA 정확도를 크게 향상시킵니다.

Original Abstract

Large Vision-Language Models (LVLMs) have advanced rapidly by aligning visual patches with the text embedding space, but a fixed visual-token budget forces images to be resized to a uniform pretraining resolution, often erasing fine-grained details and causing hallucinations via over-reliance on language priors. Recent attention-guided enhancement (e.g., cropping or region-focused attention allocation) alleviates this, yet it commonly hinges on a static "magic layer" empirically chosen on simple recognition benchmarks and thus may not transfer to complex reasoning tasks. In contrast to this static assumption, we propose a dynamic perspective on visual grounding. Through a layer-wise sensitivity analysis, we demonstrate that visual grounding is a dynamic process: while simple object recognition tasks rely on middle layers, complex visual search and reasoning tasks require visual information to be reactivated at deeper layers. Based on this observation, we introduce Visual Activation by Query (VAQ), a metric that identifies the layer whose attention map is most relevant to query-specific visual grounding by measuring attention sensitivity to the input query. Building on VAQ, we further propose LASER (Layer-adaptive Attention-guided Selective visual and decoding Enhancement for Reasoning), a training-free inference procedure that adaptively selects task-appropriate layers for visual localization and question answering. Experiments across diverse VQA benchmarks show that LASER significantly improves VQA accuracy across tasks with varying levels of complexity.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!