2603.06302v1 Mar 06, 2026 cs.CV

DEX-AR: 자기 회귀 시각-언어 모델을 위한 동적 설명 방법

DEX-AR: A Dynamic Explainability Method for Autoregressive Vision-Language Models

Hildegard Kuehne

Citations: 18

h-index: 2

Walid Bousselham

Citations: 140

h-index: 3

Angie Boggust

MIT

Citations: 613

h-index: 9

Hendrik Strobelt

Citations: 34

h-index: 2

시각-언어 모델(VLM)이 점점 더 복잡해지고 널리 사용됨에 따라, 그들의 의사 결정 과정을 이해하는 것이 더욱 중요해지고 있습니다. 기존의 설명 방법은 분류 작업에 맞춰 설계되었기 때문에, 복잡한 토큰 단위 생성 과정과 시각 및 텍스트 모달리티 간의 복잡한 상호 작용을 갖는 현대의 자기 회귀 VLM에 적용하기 어렵습니다. 본 논문에서는 DEX-AR(Dynamic Explainability for AutoRegressive models)이라는 새로운 설명 방법을 제안합니다. 이 방법은 모델의 텍스트 응답에 중요한 이미지 영역을 강조하는 토큰별 및 시퀀스 수준의 2D 히트맵을 생성하여 이러한 문제점을 해결합니다. 제안된 방법은 토큰 단위 생성 과정 동안 어텐션 맵에 대한 계층별 그래디언트를 계산하여, 다양한 계층과 생성된 토큰의 중요도를 해석할 수 있도록 합니다. DEX-AR은 두 가지 주요 혁신을 도입합니다. 첫째, 시각 정보에 집중하는 어텐션 헤드를 식별하는 동적 헤드 필터링 메커니즘입니다. 둘째, 시각적으로 기반된 토큰과 순수 언어적 토큰을 구별하면서 토큰별 설명을 집계하는 시퀀스 수준 필터링 접근 방식입니다. ImageNet, VQAv2 및 PascalVOC 데이터셋에 대한 실험 결과, 새로운 정규화 퍼플렉시티 척도를 사용한 교란 기반 지표 및 분할 기반 지표 모두에서 일관된 성능 향상을 보였습니다.

Original Abstract

As Vision-Language Models (VLMs) become increasingly sophisticated and widely used, it becomes more and more crucial to understand their decision-making process. Traditional explainability methods, designed for classification tasks, struggle with modern autoregressive VLMs due to their complex token-by-token generation process and intricate interactions between visual and textual modalities. We present DEX-AR (Dynamic Explainability for AutoRegressive models), a novel explainability method designed to address these challenges by generating both per-token and sequence-level 2D heatmaps highlighting image regions crucial for the model's textual responses. The proposed method offers to interpret autoregressive VLMs-including varying importance of layers and generated tokens-by computing layer-wise gradients with respect to attention maps during the token-by-token generation process. DEX-AR introduces two key innovations: a dynamic head filtering mechanism that identifies attention heads focused on visual information, and a sequence-level filtering approach that aggregates per-token explanations while distinguishing between visually-grounded and purely linguistic tokens. Our evaluation on ImageNet, VQAv2, and PascalVOC, shows a consistent improvement in both perturbation-based metrics, using a novel normalized perplexity measure, as well as segmentation-based metrics.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!