2605.05593v1 May 07, 2026 cs.AI

다중 모드 대규모 언어 모델의 내부 시각 표현에 대한 인과적 탐색

Causal Probing for Internal Visual Representations in Multimodal Large Language Models

Huijia Zhu

Citations: 78

h-index: 6

Weiqiang Wang

Citations: 226

h-index: 8

Zhuosheng Zhang

Citations: 67

h-index: 3

Tianjie Ju

Citations: 6

h-index: 1

Lian He

Citations: 6

h-index: 2

Jun Lan

Citations: 555

h-index: 9

Zheng Wu

Citations: 99

h-index: 7

Zehao Deng

Citations: 6

h-index: 2

다중 모드 대규모 언어 모델(MLLM)이 다양한 작업에서 놀라운 성공을 거두고 있음에도 불구하고, 이러한 모델이 어떻게 다양한 시각적 개념을 인코딩하고 연결하는지에 대한 내부 메커니즘은 여전히 잘 이해되지 못하고 있습니다. 이러한 격차를 해소하기 위해, 우리는 활성화 제어를 기반으로 하는 인과적 프레임워크를 제안하여 내부 시각 표현을 적극적으로 탐색하고 조작합니다. 네 가지 시각적 개념 범주에 대한 체계적인 개입을 통해 얻은 결과는 개념 인코딩의 차이를 보여줍니다. 구체적인 개체는 특정 위치에 집중적으로 저장되는 반면, 추상적인 개념은 네트워크 전체에 걸쳐 전반적으로 분포되어 있습니다. 중요한 점은 이러한 차이가 모델 크기 증가에 대한 작동 원리를 밝혀줍니다. 모델의 깊이를 증가시키는 것은 복잡하고 추상적인 개념을 인코딩하는 데 필수적이지만, 개체 위치 결정은 크기에 크게 영향을 받지 않습니다. 또한, 역방향 제어를 통해 명시적인 출력 차단을 막으면 잠재적인 활성화가 증가하는 것을 확인했으며, 이는 인식과 생성 간의 보상 메커니즘을 보여줍니다. 마지막으로, 시각적 추론에 대한 분석을 확장하여 인식과 추론 간의 단절을 밝혀냈습니다. MLLM은 기하학적 관계를 성공적으로 인식하지만, 이를 단순히 정적인 시각적 특징으로 취급하며, 추상적인 문제 해결에 필요한 절차적 실행을 유도하지 못합니다.

Original Abstract

Despite the remarkable success of Multimodal Large Language Models (MLLMs) across diverse tasks, the internal mechanisms governing how they encode and ground distinct visual concepts remain poorly understood. To bridge this gap, we propose a causal framework based on activation steering to actively probe and manipulate internal visual representations. Through systematic intervention across four visual concept categories, our results reveal a divergence in concept encoding: entities exhibit distinct localized memorization, whereas abstract concepts are globally distributed across the network. Critically, this divergence uncovers a mechanistic driver of scaling laws: increasing model depth is indispensable for encoding distributed and complex abstract concepts, whereas entity localization remains remarkably invariant to scale. Furthermore, reverse steering uncovers that blocking explicit output triggers a surge in latent activations, exposing a compensatory mechanism between perception and generation. Finally, extending our analysis to visual reasoning, we expose a disconnect between perception and reasoning although MLLMs successfully recognize geometric relations, they treat them merely as static visual features, failing to trigger the procedural execution necessary for abstract problem-solving.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!