2602.14452v1 Feb 16, 2026 cs.LG

WiSparse: 가중치 기반 혼합 활성화 희소화를 통한 LLM 추론 효율성 향상

WiSparse: Boosting LLM Inference Efficiency with Weight-Aware Mixed Activation Sparsity

Lei Chen

Citations: 70

h-index: 2

Yuan Meng

Citations: 0

h-index: 0

Xiaoyu Zhan

Citations: 31

h-index: 4

Zhi Wang

Citations: 156

h-index: 7

Wenwu Zhu

Citations: 19,904

h-index: 61

대규모 언어 모델(LLM)은 강력한 기능을 제공하지만, 밀집 연산과 메모리 접근으로 인해 높은 추론 비용이 발생합니다. 활성화 희소화는 효율적인 LLM 추론을 위한 유망한 접근 방식이지만, 기존 방법은 종종 활성화 정보에만 의존하고 균일한 희소화 비율을 사용합니다. 이는 가중치와의 중요한 상호 작용과 모델 블록 간의 감도 변화를 고려하지 않아 최적의 성능을 달성하지 못합니다. 우리는 최신 LLM에서 두 가지 중요한 현상을 발견했습니다. 1) 중요도가 낮은 활성화는 매우 중요한 가중치와 일치할 수 있으며, 2) 모델 블록 간의 희소화 민감도는 단조롭게 변하지 않습니다. 우리는 활성화 및 가중치 정보를 모두 활용하여 적응적인 희소화 할당을 수행하는 가중치 기반 혼합 입자 크기 무(훈련) 활성화 희소화(WiSparse)를 제안합니다. 구체적으로, 우리는 활성화 크기를 미리 계산된 가중치 정규화 값과 통합하는 가중치 기반 메커니즘을 도입하여 중요한 채널을 정확하게 식별합니다. 이는 전역 예산을 사용하여 민감한 영역을 보호하기 위해 진화적 탐색을 통해 블록 간에 분배하고, 블록 내에서 재구성 오류를 최소화하는 혼합 입자 크기 할당 방식을 결합합니다. 우리는 개선된 희소 커널을 개발하고 세 가지 대표적인 모델에서 그 효과를 입증했습니다. 주목할 만한 점은 50%의 희소화에서 WiSparse가 Llama3.1의 밀집 성능의 97%를 유지하며, 가장 강력한 기준 성능보다 2.23%p 더 뛰어나고, 전체 추론 속도를 21.4% 향상시켰습니다. 우리의 연구는 훈련 없이 효율적인 LLM 추론을 위한 접근 방식의 한계를 확장하며, 훈련 없이 달성할 수 있는 속도 향상의 경계를 넓힙니다.

Original Abstract

Large Language Models (LLMs) offer strong capabilities but incur high inference costs due to dense computation and memory access. Training-free activation sparsity is a promising approach for efficient LLM inference, yet existing methods often rely solely on activation information and uniform sparsity ratios. This overlooks the critical interplay with weights and inter-block sensitivity variation, leading to suboptimal performance. We identify two key phenomena in modern LLMs: 1) less significant activations may align with highly important weights, and 2) sparsity sensitivity varies non-monotonically across model blocks. We propose Weight-aware Mixed-Granularity Training-free Activation Sparsity (WiSparse), which leverages both activation and weight information for adaptive sparsity allocation. Specifically, we introduce a weight-aware mechanism integrating activation magnitudes with precomputed weight norms to accurately identify salient channels. This is combined with a mixed-granularity allocation scheme: a global budget is distributed across blocks via evolutionary search to protect sensitive regions, then refined within blocks to minimize reconstruction error. We improve sparse kernels and demonstrate effectiveness on three representative models. Notably, at 50% sparsity, WiSparse preserves 97% of Llama3.1's dense performance, surpassing the strongest baseline by 2.23 percentage points while achieving a 21.4% acceleration in end-to-end inference speed. Our research advances the limits of training-free approaches for efficient LLM inference, pushing the boundaries of achievable speedup without training.

0 Citations

0 Influential

30 Altmetric

150.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!