2605.28302v1 May 27, 2026 cs.LG

How Far Can Disaggregation Go? A Design-Space Exploration of Attention-FFN Disaggregation for Efficient MoE LLM Serving

Suvinay Subramanian
Suvinay Subramanian
Citations: 1,939
h-index: 18
Sarbartha Banerjee
Sarbartha Banerjee
Citations: 111
h-index: 6
A. Bambhaniya
A. Bambhaniya
Citations: 140
h-index: 7
Tuhin Khare
Tuhin Khare
Citations: 52
h-index: 3
S. Srinivasan
S. Srinivasan
Citations: 2,352
h-index: 15
Souvik Kundu
Souvik Kundu
Citations: 245
h-index: 8
Midhilesh Elavazhagan
Midhilesh Elavazhagan
Citations: 54
h-index: 4
William Won
William Won
Citations: 279
h-index: 7
Amir Yazdanbakhsh
Amir Yazdanbakhsh
Citations: 3,361
h-index: 7
Tushar Krishna
Tushar Krishna
Citations: 73
h-index: 5
Hanjiang Wu
Hanjiang Wu
Citations: 27
h-index: 3
Madhu Kumar
Madhu Kumar
Citations: 20
h-index: 2

Modern large language model (LLM) inference has progressively disaggregated to keep pace with growing model sizes and tight TTFT and TPOT service-level objectives: from chunked-prefill aggregation, to prefill-decode (P/D) disaggregation, and most recently to operator-level Attention-FFN Disaggregation (AFD). This trend is especially important for mixture-of-experts (MoE) models, where memory-bound attention, compute-intensive expert FFNs, and MoE dispatch/combine communication create distinct resource demands. AFD further exposes this heterogeneity by placing attention and MoE-FFN execution on separate GPU groups. Each level of disaggregation deepens the scheduling design space across workload characteristics, resource allocation, and interconnect topology, raising the central question: when does each level actually pay off? We systematically characterize this trade-off for MoE inference across realistic workloads spanning input/output sequence lengths, prefix-KV reuse, and per-user latency constraints. Using chunked-prefill and P/D disaggregation as baselines, we study the benefits and limits of AFD at scale through a framework that fuses on-device kernel measurements with high-fidelity network simulation. Under strict TTFT/TPOT SLOs, AFD sustains around 4k tokens/s of system throughput on DeepSeek-V3.2 across chat, coding, and agentic-coding workloads, where non-AFD deployments are infeasible. We distill concrete takeaways for jointly optimizing throughput and interactivity, including how to partition attention and FFN across GPUs as a function of workload and model architecture, providing design principles for current rack- and cluster-scale deployments as well as future disaggregated AI infrastructure.

0 Citations
0 Influential
9 Altmetric
45.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!