2603.18523v1 Mar 19, 2026 cs.CV

회로 계산: 대규모 시각-언어 모델의 시각적 추론에 대한 기계적 해석

Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models

Zhiyu Xue

Citations: 27

h-index: 2

Liwei Che

Citations: 54

h-index: 3

Yihao Quan

Citations: 121

h-index: 4

Benlin Liu

Citations: 1,952

h-index: 9

Zeru Shi

Citations: 20

h-index: 2

M. Hurst

Citations: 318

h-index: 9

Jacob Feldman

Citations: 4

h-index: 1

Ruixiang Tang

Citations: 115

h-index: 7

Ranjay Krishna

Citations: 224

h-index: 7

Vladimir Pavlovic

Citations: 2

h-index: 1

회수는 대규모 시각-언어 모델(LVLM)의 추론 능력을 평가하는 간단하면서도 강력한 방법입니다. 이는 모델이 개별 객체를 식별하고 모두 더하도록 요구합니다. 본 연구에서는 통제된 합성 데이터 및 실제 데이터 벤치마크를 결합하고, 기계적 분석을 통해 LVLM이 회수를 어떻게 구현하는지 조사합니다. 연구 결과는 LVLM이 인간과 유사한 회수 행동을 보이며, 작은 숫자에서는 정확한 성능을 보이지만, 큰 수량에서는 노이즈가 있는 추정을 수행한다는 것을 보여줍니다. 우리는 두 가지 새로운 해석 방법인 시각 활성화 패치 및 헤드렌즈를 도입하고, 이를 사용하여 다양한 시각적 추론 작업에서 널리 공유되는 구조화된 "회로"를 밝혀냅니다. 이러한 통찰력을 바탕으로, 간단하고 풍부하게 사용 가능한 합성 이미지를 활용하여 사전 훈련된 LVLM을 회수 작업에만 미세 조정하는 경량 개입 전략을 제안합니다. 이러한 미세 조정은 범위가 좁지만, 회수 작업 데이터에 대한 정확도를 향상시킬 뿐만 아니라, Qwen2.5-VL 모델의 경우, 분산 데이터 회수 벤치마크에서 평균 +8.36%의 성능 향상과 복잡한 일반 시각적 추론 작업에서 평균 +1.54%의 성능 향상을 가져왔습니다. 이러한 결과는 회수가 시각적 추론에서 중심적이고 중요한 역할을 한다는 것을 강조하며, 회수 메커니즘을 목표로 개선함으로써 전체적인 시각적 추론 능력을 향상시킬 수 있는 잠재적인 방법을 제시합니다.

Original Abstract

Counting serves as a simple but powerful test of a Large Vision-Language Model's (LVLM's) reasoning; it forces the model to identify each individual object and then add them all up. In this study, we investigate how LVLMs implement counting using controlled synthetic and real-world benchmarks, combined with mechanistic analyses. Our results show that LVLMs display a human-like counting behavior, with precise performance on small numerosities and noisy estimation for larger quantities. We introduce two novel interpretability methods, Visual Activation Patching and HeadLens, and use them to uncover a structured "counting circuit" that is largely shared across a variety of visual reasoning tasks. Building on these insights, we propose a lightweight intervention strategy that exploits simple and abundantly available synthetic images to fine-tune arbitrary pretrained LVLMs exclusively on counting. Despite the narrow scope of this fine-tuning, the intervention not only enhances counting accuracy on in-distribution synthetic data, but also yields an average improvement of +8.36% on out-of-distribution counting benchmarks and an average gain of +1.54% on complex, general visual reasoning tasks for Qwen2.5-VL. These findings highlight the central, influential role of counting in visual reasoning and suggest a potential pathway for improving overall visual reasoning capabilities through targeted enhancement of counting mechanisms.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!