2602.12128v1 Feb 12, 2026 cs.AI

HLA: 아다마르 선형 어텐션

HLA: Hadamard Linear Attention

Hanno Ackermann

Citations: 9

h-index: 1

Mohsen Ghafoorian

Citations: 26

h-index: 3

A. Habibian

Citations: 1,285

h-index: 18

Hong Cai

Citations: 140

h-index: 6

어텐션 메커니즘은 트랜스포머 성공의 중요한 요인입니다. 이는 토큰 간의 쌍별 관계를 계산하는 데 의존합니다. 표준 이차(quadratic) 어텐션의 높은 계산 비용을 줄이기 위해, 효율적인 근사법으로 선형 어텐션이 제안되었습니다. 선형 어텐션은 쌍별 유사도가 계산되기 전에 입력에 독립적으로 적용되는 커널 함수를 사용합니다. 이는 효율적인 계산 과정을 가능하게 하지만, 결과적으로 소프트맥스를 근사하는 저차 유리 함수(low-degree rational function)에 해당합니다. 이에 우리는 아다마르 선형 어텐션(HLA)을 제안합니다. 기존 선형 어텐션 연구들과 달리, HLA의 비선형성은 쿼리와 키에 개별적으로 적용되는 것이 아니라, 표준 소프트맥스 어텐션과 유사하게 쌍별 유사도가 계산된 후에 적용됩니다. 본 논문에서는 제안된 비선형성이 소프트맥스를 근사하는 고차 유리 함수에 해당함을 보입니다. 또한 제안된 방법을 위해 표준 선형 어텐션과 유사한 효율적인 계산 방식을 유도합니다. 다른 접근 방식들과 달리, 제안된 알고리즘을 적용하기 위해 시간 소모적인 텐서 재형성(reshaping) 과정이 필요하지 않습니다. 본 접근 방식의 유효성은 매우 많은 양의 토큰을 처리하는 비디오 생성을 위한 대규모 디퓨전 트랜스포머 모델에 적용하여 입증하였습니다.

Original Abstract

The attention mechanism is an important reason for the success of transformers. It relies on computing pairwise relations between tokens. To reduce the high computational cost of standard quadratic attention, linear attention has been proposed as an efficient approximation. It employs kernel functions that are applied independently to the inputs before the pairwise similarities are calculated. That allows for an efficient computational procedure which, however, amounts to a low-degree rational function approximating softmax. We propose Hadamard Linear Attention (HLA). Unlike previous works on linear attention, the nonlinearity in HLA is not applied separately to queries and keys, but, analogously to standard softmax attention, after the pairwise similarities have been computed. It will be shown that the proposed nonlinearity amounts to a higher-degree rational function to approximate softmax. An efficient computational scheme for the proposed method is derived that is similar to that of standard linear attention. In contrast to other approaches, no time-consuming tensor reshaping is necessary to apply the proposed algorithm. The effectiveness of the approach is demonstrated by applying it to a large diffusion transformer model for video generation, an application that involves very large amounts of tokens.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!