2601.17917v2 Jan 25, 2026 cs.LG

Streaming-dLLM: 서피스 가지치기 및 동적 디코딩을 통한 확산 기반 대규모 언어 모델 가속화

Streaming-dLLM: Accelerating Diffusion LLMs via Suffix Pruning and Dynamic Decoding

Zhongyu Xiao

Citations: 18

h-index: 2

Zhiwei Hao

Citations: 511

h-index: 12

Jianyuan Guo

Citations: 36

h-index: 3

Yong Luo

Citations: 36

h-index: 4

Jia Liu

Citations: 75

h-index: 4

Jie Xu

Citations: 80

h-index: 5

Han Hu

Citations: 222

h-index: 5

확산 기반 대규모 언어 모델(dLLM)은 병렬 디코딩 및 양방향 어텐션을 활용하여 자기 회귀 모델보다 우수한 전반적인 일관성을 달성하는 매력적인 자연어 생성 패러다임을 제공합니다. 최근 연구에서는 KV 캐시 재사용 또는 휴리스틱 디코딩을 통해 추론 속도를 가속화했지만, 확산 과정 내의 고유한 비효율성을 간과했습니다. 특히, 정보량이 적은 서피스 영역을 균일하게 모델링하여 공간적 중복성을 유발하고, 전체 디코딩 과정에서 고정된 노이즈 제거 스케줄을 적용하여 시간적 비효율성을 초래합니다. 이러한 문제를 해결하기 위해, 우리는 공간적 및 시간적 측면에서 추론을 최적화하는 훈련 불필요한 프레임워크인 Streaming-dLLM을 제안합니다. 공간적으로, 우리는 전체 컨텍스트를 근사하기 위해 불필요한 마스크 토큰을 가지치기하는 감쇠 기반 서피스 모델링을 도입했습니다. 시간적으로, 우리는 동적 신뢰도 기반 전략과 조기 종료 메커니즘을 사용하여 모델이 수렴된 토큰에 대한 불필요한 반복을 건너뛸 수 있도록 합니다. 광범위한 실험 결과, Streaming-dLLM은 최대 68.2배의 속도 향상을 달성하면서도 생성 품질을 유지하며, 확산 디코딩의 효과를 입증합니다. 코드 및 관련 정보는 다음 링크에서 확인할 수 있습니다: https://github.com/xiaoshideta/Streaming-dLLM.

Original Abstract

Diffusion Large Language Models (dLLMs) offer a compelling paradigm for natural language generation, leveraging parallel decoding and bidirectional attention to achieve superior global coherence compared to autoregressive models. While recent works have accelerated inference via KV cache reuse or heuristic decoding, they overlook the intrinsic inefficiencies within the block-wise diffusion process. Specifically, they suffer from spatial redundancy by modeling informative-sparse suffix regions uniformly and temporal inefficiency by applying fixed denoising schedules across all the decoding process. To address this, we propose Streaming-dLLM, a training-free framework that streamlines inference across both spatial and temporal dimensions. Spatially, we introduce attenuation guided suffix modeling to approximate the full context by pruning redundant mask tokens. Temporally, we employ a dynamic confidence aware strategy with an early exit mechanism, allowing the model to skip unnecessary iterations for converged tokens. Extensive experiments show that Streaming-dLLM achieves up to 68.2X speedup while maintaining generation quality, highlighting its effectiveness in diffusion decoding. The code is available at https://github.com/xiaoshideta/Streaming-dLLM.

2 Citations

0 Influential

40.166066720281 Altmetric

202.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!