2605.28207v1 May 27, 2026 cs.CL

Pruning and Distilling Mixture-of-Experts into Dense Language Models

Gyeongman Kim
Gyeongman Kim
Citations: 180
h-index: 5
Jihun Yun
Jihun Yun
Citations: 15
h-index: 3
Haechan Kim
Haechan Kim
Citations: 8
h-index: 2
Junhyuck Kim
Junhyuck Kim
Citations: 54
h-index: 3
J. Bae
J. Bae
Citations: 0
h-index: 0
Jaewoong Cho
Jaewoong Cho
Citations: 49
h-index: 3

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

0 Citations
0 Influential
2.5 Altmetric
12.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!