2602.11534v1 Feb 12, 2026 cs.LG

크라우스 동기화 트랜스포머

Krause Synchronization Transformers

Yisong Yue

Citations: 14

h-index: 2

Max Welling

Citations: 107

h-index: 4

Yue Song

Citations: 3,510

h-index: 6

Jingkun Liu

Citations: 108

h-index: 3

트랜스포머의 셀프 어텐션(Self-attention)은 전역적으로 정규화된 소프트맥스(softmax) 가중치에 의존하며, 이로 인해 모든 토큰이 매 계층에서 영향력을 얻기 위해 경쟁하게 된다. 모델의 깊이가 깊어질수록 이러한 상호작용 패턴은 지배적인 모드로의 수렴을 선호하는 강력한 동기화 역학을 유발하며, 이는 표현 붕괴(representation collapse) 및 어텐션 싱크(attention sink) 현상과 연관된 동작이다. 우리는 제한된 신뢰 합의 역학(bounded-confidence consensus dynamics)에서 영감을 받은 원칙적인 어텐션 메커니즘인 크라우스 어텐션(Krause Attention)을 제안한다. 크라우스 어텐션은 유사도 기반의 전역 집계를 거리 기반의 국소적이고 선택적으로 희소한 상호작용으로 대체하여, 전역 혼합(global mixing) 대신 구조화된 국소 동기화를 촉진한다. 우리는 이러한 동작을 트랜스포머 역학을 상호작용하는 입자 시스템으로 모델링하는 최근 이론과 연결하고, 제한된 신뢰 상호작용이 어떻게 어텐션 집중을 자연스럽게 조절하고 어텐션 싱크를 완화하는지 보여준다. 또한 상호작용을 국소적 이웃으로 제한함으로써 실행 시간 복잡도를 시퀀스 길이에 대해 이차(quadratic)에서 선형(linear)으로 감소시킨다. 비전(CIFAR/ImageNet에서의 ViT), 자기회귀 생성(MNIST/CIFAR-10), 대형 언어 모델(Llama/Qwen)에 걸친 실험들은 크게 감소된 연산량으로 일관된 성능 향상을 입증하며, 어텐션을 위한 확장 가능하고 효과적인 귀납적 편향(inductive bias)으로서 제한된 신뢰 역학의 가치를 강조한다.

Original Abstract

Self-attention in Transformers relies on globally normalized softmax weights, causing all tokens to compete for influence at every layer. When composed across depth, this interaction pattern induces strong synchronization dynamics that favor convergence toward a dominant mode, a behavior associated with representation collapse and attention sink phenomena. We introduce Krause Attention, a principled attention mechanism inspired by bounded-confidence consensus dynamics. Krause Attention replaces similarity-based global aggregation with distance-based, localized, and selectively sparse interactions, promoting structured local synchronization instead of global mixing. We relate this behavior to recent theory modeling Transformer dynamics as interacting particle systems, and show how bounded-confidence interactions naturally moderate attention concentration and alleviate attention sinks. Restricting interactions to local neighborhoods also reduces runtime complexity from quadratic to linear in sequence length. Experiments across vision (ViT on CIFAR/ImageNet), autoregressive generation (MNIST/CIFAR-10), and large language models (Llama/Qwen) demonstrate consistent gains with substantially reduced computation, highlighting bounded-confidence dynamics as a scalable and effective inductive bias for attention.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!