2602.14301v1 Feb 15, 2026 cs.LG

DeepFusion: 이종 엣지 장치로부터의 연합 지식 증류를 통한 MoE 학습 가속화

DeepFusion: Accelerating MoE Training via Federated Knowledge Distillation from Heterogeneous Edge Devices

Songyuan Li

Citations: 19

h-index: 2

Jia Hu

Citations: 79

h-index: 4

A. M. Abdelmoniem

Citations: 1,715

h-index: 22

Geyong Min

Citations: 207

h-index: 5

Haojun Huang

Citations: 10

h-index: 2

Jiwei Huang

Citations: 1

h-index: 1

최근의 Qwen-MoE 및 DeepSeek-MoE와 같은 Mixture-of-Experts (MoE) 기반 대규모 언어 모델(LLM)은 자연어 처리 분야의 생성형 AI를 혁신하고 있습니다. 그러나 이러한 모델은 방대하고 다양한 학습 데이터를 필요로 합니다. 연합 학습(FL)은 개인 데이터를 활용하여 MoE 학습 시 프라이버시를 보호하는 방식으로 이 문제를 해결합니다. 그러나 기존의 FL 접근 방식은 장치가 로컬 MoE 모델을 호스팅해야 하는데, 이는 대규모 모델 크기로 인해 리소스가 제한된 장치에서는 비현실적입니다. 이를 해결하기 위해, 우리는 이종 온장치 LLM 지식을 연합 지식 증류를 통해 통합하는 최초의 확장 가능한 연합 MoE 학습 프레임워크인 DeepFusion을 제안합니다. 구체적으로, DeepFusion은 각 장치가 자체 요구 사항 및 하드웨어 제한에 맞게 온장치 LLM을 독립적으로 구성하고 학습하도록 합니다. 또한, 우리는 글로벌 MoE 모델의 다단계 특징 표현을 통합하여 온장치 LLM과 일치하는 예측 관점을 구축하는 새로운 View-Aligned Attention (VAA) 모듈을 제안합니다. 이를 통해 효과적인 교차 아키텍처 지식 증류가 가능합니다. VAA는 명시적으로 예측 관점을 정렬함으로써, 온장치 LLM과 글로벌 MoE 모델 간의 모델 아키텍처 및 예측 동작의 이질성으로 인해 발생하는 기존 연합 지식 증류의 '뷰 불일치' 문제를 해결합니다. 산업 수준의 MoE 모델(Qwen-MoE 및 DeepSeek-MoE)과 실제 데이터셋(의료 및 금융)을 사용한 실험 결과, DeepFusion은 중앙 집중식 MoE 학습에 근접한 성능을 달성합니다. 주요 연합 MoE 기준과 비교했을 때, DeepFusion은 통신 비용을 최대 71% 줄이고, 토큰 퍼플렉시티를 최대 5.28% 향상시킵니다.

Original Abstract

Recent Mixture-of-Experts (MoE)-based large language models (LLMs) such as Qwen-MoE and DeepSeek-MoE are transforming generative AI in natural language processing. However, these models require vast and diverse training data. Federated learning (FL) addresses this challenge by leveraging private data from heterogeneous edge devices for privacy-preserving MoE training. Nonetheless, traditional FL approaches require devices to host local MoE models, which is impractical for resource-constrained devices due to large model sizes. To address this, we propose DeepFusion, the first scalable federated MoE training framework that enables the fusion of heterogeneous on-device LLM knowledge via federated knowledge distillation, yielding a knowledge-abundant global MoE model. Specifically, DeepFusion features each device to independently configure and train an on-device LLM tailored to its own needs and hardware limitations. Furthermore, we propose a novel View-Aligned Attention (VAA) module that integrates multi-stage feature representations from the global MoE model to construct a predictive perspective aligned with on-device LLMs, thereby enabling effective cross-architecture knowledge distillation. By explicitly aligning predictive perspectives, VAA resolves the view-mismatch problem in traditional federated knowledge distillation, which arises from heterogeneity in model architectures and prediction behaviors between on-device LLMs and the global MoE model. Experiments with industry-level MoE models (Qwen-MoE and DeepSeek-MoE) and real-world datasets (medical and finance) demonstrate that DeepFusion achieves performance close to centralized MoE training. Compared with key federated MoE baselines, DeepFusion reduces communication costs by up to 71% and improves token perplexity by up to 5.28%.

0 Citations

0 Influential

11 Altmetric

55.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!