2604.15750v1 Apr 17, 2026 cs.LG

DepCap: 효율적인 디퓨전 언어 모델 추론을 위한 적응형 블록 기반 병렬 디코딩

DepCap: Adaptive Block-Wise Parallel Decoding for Efficient Diffusion LM Inference

Wuyang Zhang

Citations: 39

h-index: 3

Yanyong Zhang

Citations: 20

h-index: 3

Xiangwen Xia

Citations: 0

h-index: 0

Jiazhen Liu

Citations: 11

h-index: 3

Chen Yan

Citations: 10

h-index: 2

디퓨전 언어 모델(DLM)은 전체 시퀀스의 병렬 디코딩 및 전역적인 개선 가능성 덕분에 자기 회귀 언어 생성의 유망한 대안으로 부상했습니다. DLM 추론은 생성 품질과 디코딩 속도의 균형을 신중하게 맞춰야 합니다. 최근의 블록 기반 DLM 디코딩 방법은 디퓨전 기반 디코딩을 블록 단위로 순차적으로 수행하여 이러한 균형을 개선합니다. 그러나 기존 방법은 일반적으로 고정된 블록 스케줄 또는 현재 단계의 로컬 신호를 사용하여 블록 경계를 결정하고, 충돌을 피하기 위해 보수적인 신뢰도 기반의 병렬 디코딩을 사용하며, 이는 품질-속도 균형을 제한합니다. 본 논문에서는 블록 기반 DLM 추론이 두 가지 핵심 결정에 더 적합한 신호가 필요하다고 주장합니다. 즉, 블록 경계를 결정하는 데 필요한 단계 간 신호와, 병렬 디코딩을 위한 토큰 수준의 충돌 신호입니다. 이러한 관점에 따라, 본 논문에서는 효율적인 블록 기반 DLM 추론을 위한 학습이 필요 없는 프레임워크인 DepCap을 제안합니다. 구체적으로, DepCap은 단계 간 신호를 마지막으로 디코딩된 블록의 영향으로 구현하고, 이를 사용하여 다음 블록이 얼마나 확장되어야 하는지를 적응적으로 결정합니다. 또한, 각 블록 내에서 안전하게 병렬 디코딩을 수행할 수 있는 충돌 없는 토큰 집합을 식별하여 상당한 추론 속도 향상을 제공하면서 품질 저하를 최소화합니다. DepCap은 다양한 DLM에 적용 가능한 플러그 앤 플레이 방식이며, 블록 기반 DLM을 위한 기존의 KV-캐시 전략과 호환됩니다. 정보 이론적 분석 결과, 후보 블록에 대한 누적된 마지막 블록의 영향은 토큰에 대해 대략적으로 가산적이며, 이는 제안된 블록 분할 기준을 뒷받침합니다. 실험 결과는 DepCap이 다양한 DLM 아키텍처 및 추론 및 코딩 벤치마크에서 우수한 품질-속도 균형을 달성하며, 최대 5.63배의 속도 향상을 제공하면서도 성능 저하가 미미하다는 것을 보여줍니다.

Original Abstract

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive language generation due to their potential for parallel decoding and global refinement of the entire sequence. To unlock this potential, DLM inference must carefully balance generation quality and decoding speed. Recent block-wise DLM decoding methods improve this trade-off by performing diffusion-based decoding sequentially in blocks. However, existing methods typically rely on fixed block schedules or current-step local signals to determine block boundaries, and use conservative confidence-based parallel decoding to avoid conflicts, limiting the quality-speed trade-off. In this paper, we argue that block-wise DLM inference requires more suitable signals for its two core decisions: cross-step signals for determining block boundaries, and token-level conflict signals for parallel decoding. Based on this view, we propose DepCap, a training-free framework for efficient block-wise DLM inference. Specifically, DepCap instantiates the cross-step signal as the influence of the last decoded block and uses it to adaptively determine how far the next block should extend, while identifying a conflict-free subset of tokens for safe parallel decoding within each block, enabling substantial inference acceleration with negligible quality degradation. DepCap is a plug-and-play method applicable to various DLMs, and compatible with existing KV-cache strategies for block-wise DLM. An information-theoretic analysis further suggests that the cumulative last-block influence on a candidate block is approximately additive across tokens, supporting the proposed block-partitioning criterion. Experimental results show that DepCap achieves favorable speed-quality trade-offs across multiple DLM backbones and reasoning and coding benchmarks, with up to 5.63$\times$ speedup without significant performance degradation.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!