2603.11873v1 Mar 12, 2026 cs.AI

AdaFuse: 토큰 레벨 프리-게이팅 및 퓨즈드 커널 최적화를 통한 동적 어댑터 추론 가속화

AdaFuse: Accelerating Dynamic Adapter Inference via Token-Level Pre-Gating and Fused Kernel Optimization

Shuaiqiang Wang

Citations: 2,668

h-index: 21

Dawei Yin

Citations: 1,554

h-index: 19

Yuchen Li

Citations: 71

h-index: 5

Hengyi Cai

Citations: 531

h-index: 10

Qiyang Li

Citations: 23

h-index: 2

Rui Kong

Citations: 394

h-index: 4

Linghe Kong

Citations: 40

h-index: 3

Guihai Chen

Citations: 769

h-index: 14

Mixture-of-Experts (MoE)와 같은 동적, 희소 구조를 파라미터 효율적인 어댑터(예: LoRA)와 통합하는 것은 대규모 언어 모델(LLM)의 성능을 향상시키는 강력한 기술입니다. 그러나 이러한 아키텍처 개선은 상당한 비용을 초래하며, 계산 부하가 미미하게 증가했음에도 불구하고 추론 지연 시간이 종종 2.5배 이상 증가하여 디코딩 속도가 현저히 느려집니다. 세밀한 성능 분석을 통해, 우리는 주요 병목 지점이 계산 자체에 있는 것이 아니라, 기존의 동적 라우팅에 필요한 단편화되고 순차적인 CUDA 커널 실행으로 인한 심각한 오버헤드에 있음을 확인했습니다. 이러한 문제를 해결하기 위해, 우리는 알고리즘과 하드웨어 시스템 간의 긴밀한 통합을 통해 효율적인 동적 어댑터 실행을 가능하게 하는 프레임워크인 AdaFuse를 소개합니다. AdaFuse는 기존의 레이어별 또는 블록별 라우팅 방식을 벗어나, 토큰 레벨 프리-게이팅 전략을 사용하여 토큰이 처리되기 전에 모든 어댑터 레이어에 대한 단일하고 글로벌한 라우팅 결정을 내립니다. 이 "한 번 결정, 모든 곳에 적용" 방식은 각 토큰에 대한 실행 경로를 효과적으로 정적화하여 전체적인 최적화의 기회를 제공합니다. 우리는 이를 활용하여, 선택된 모든 LoRA 어댑터의 파라미터를 단일하고 효율적인 단계로 백본 모델에 병합하는 퓨즈드 스위칭 연산을 수행하는 사용자 정의 CUDA 커널을 개발했습니다. 인기 있는 오픈 소스 LLM에 대한 실험 결과는 AdaFuse가 최첨단 동적 어댑터와 동등한 정확도를 달성하는 동시에 디코딩 지연 시간을 2.4배 이상 단축하여 모델의 기능과 추론 효율성 간의 격차를 좁힌다는 것을 보여줍니다.

Original Abstract

The integration of dynamic, sparse structures like Mixture-of-Experts (MoE) with parameter-efficient adapters (e.g., LoRA) is a powerful technique for enhancing Large Language Models (LLMs). However, this architectural enhancement comes at a steep cost: despite minimal increases in computational load, the inference latency often skyrockets, leading to decoding speeds slowing by over 2.5 times. Through a fine-grained performance analysis, we pinpoint the primary bottleneck not in the computation itself, but in the severe overhead from fragmented, sequential CUDA kernel launches required for conventional dynamic routing. To address this challenge, we introduce AdaFuse, a framework built on a tight co-design between the algorithm and the underlying hardware system to enable efficient dynamic adapter execution. Departing from conventional layer-wise or block-wise routing, AdaFuse employs a token-level pre-gating strategy, which makes a single, global routing decision for all adapter layers before a token is processed. This "decide-once, apply-everywhere" approach effectively staticizes the execution path for each token, creating an opportunity for holistic optimization. We capitalize on this by developing a custom CUDA kernel that performs a fused switching operation, merging the parameters of all selected LoRA adapters into the backbone model in a single, efficient pass. Experimental results on popular open-source LLMs show that AdaFuse achieves accuracy on par with state-of-the-art dynamic adapters while drastically cutting decoding latency by a factor of over 2.4x, thereby bridging the gap between model capability and inference efficiency.

0 Citations

0 Influential

10.5 Altmetric

52.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!