2602.01893v1 Feb 02, 2026 cs.AI

멀티 헤드 어텐션 내 토큰 선택의 기하학적 분석

Geometric Analysis of Token Selection in Multi-Head Attention

Timur Mudarisov

Citations: 3

h-index: 1

Mikhal Burtsev

Citations: 0

h-index: 0

Tatiana Petrova

Citations: 10

h-index: 2

Radu State

Citations: 51

h-index: 3

우리는 대규모 언어 모델(LLM)의 멀티 헤드 어텐션을 분석하기 위한 기하학적 프레임워크를 제시한다. 기존 메커니즘을 변경하지 않고 표준 어텐션을 Top-N 선택의 관점에서 바라보며, 값 상태(value-state) 공간에서의 동작을 직접 연구한다. 선택된 토큰과 선택되지 않은 토큰 간의 분리 가능성을 정량화하기 위해 기하학적 지표인 정밀도(Precision), 재현율(Recall), F-score를 정의하고, 경험적 근거에 기반한 가정(압축된 싱크 토큰을 포함한 안정적인 값 노름, 지수적 유사도 감소, 구간별 어텐션 가중치 프로파일) 하에서 차원과 마진에 명시적으로 의존하는 비점근적 경계(non-asymptotic bounds)를 도출한다. 이 이론은 가장 강력하고 유의미한 분리 가능성을 보이는 작은 N(small-N) 작동 영역을 예측하며, 시퀀스 길이와 싱크 유사도가 지표를 어떻게 형성하는지 명확히 한다. 경험적으로 LLaMA-2-7B, Gemma-7B, Mistral-7B 전반에 걸쳐 측정값은 이론적 범위를 밀접하게 따르는 것으로 나타났다. 즉, Top-N 선택은 분리 가능성을 강화하고, 싱크 유사도는 재현율과 상관관계를 보인다. 또한 LLaMA-2-7B에서 어텐션 헤드들이 뚜렷한 기하학적 특징을 가진 Retriever, Mixer, Reset의 세 가지 유형으로 전문화된다는 사실을 발견했다. 전반적으로 어텐션은 토큰 선택에 대한 측정 가능한 기준을 가진 구조화된 기하학적 분류기로 동작하며, 이는 헤드 수준의 해석 가능성을 제공하고 LLM의 기하학 기반 희소화(sparsification) 및 어텐션 설계에 기여한다.

Original Abstract

We present a geometric framework for analysing multi-head attention in large language models (LLMs). Without altering the mechanism, we view standard attention through a top-N selection lens and study its behaviour directly in value-state space. We define geometric metrics - Precision, Recall, and F-score - to quantify separability between selected and non-selected tokens, and derive non-asymptotic bounds with explicit dependence on dimension and margin under empirically motivated assumptions (stable value norms with a compressed sink token, exponential similarity decay, and piecewise attention weight profiles). The theory predicts a small-N operating regime of strongest non-trivial separability and clarifies how sequence length and sink similarity shape the metrics. Empirically, across LLaMA-2-7B, Gemma-7B, and Mistral-7B, measurements closely track the theoretical envelopes: top-N selection sharpens separability, sink similarity correlates with Recall. We also found that in LLaMA-2-7B heads specialize into three regimes - Retriever, Mixer, Reset - with distinct geometric signatures. Overall, attention behaves as a structured geometric classifier with measurable criteria for token selection, offering head level interpretability and informing geometry-aware sparsification and design of attention in LLMs.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!