2604.25421v1 Apr 28, 2026 cs.LG

FED-FSTQ: Fisher 가이드 토큰 양자화 - 엣지 장치에서 통신 효율적인 LLM 연합 미세 조정

FED-FSTQ: Fisher-Guided Token Quantization for Communication-Efficient Federated Fine-Tuning of LLMs on Edge Devices

Lu Wang

Citations: 1,684

h-index: 20

Ming Lei

Citations: 75

h-index: 3

Kaishun Wu

Citations: 86

h-index: 6

Fei Luo

Citations: 2

h-index: 1

Shuanghong Huang

Citations: 7

h-index: 1

Jiduo Xing

Citations: 108

h-index: 6

Chang Li

Citations: 21

h-index: 1

Jiasheng Liu

Citations: 13

h-index: 1

연합 미세 조정은 중앙 집중화된 데이터 없이 엣지 장치에서 대규모 언어 모델(LLM)을 조정하는 실용적인 방법을 제공하지만, 모바일 환경에서 훈련 시간은 종종 다양한 대역폭과 간헐적인 참여로 인해 지연되는 업링크 통신에 의해 제한됩니다. 파라미터 효율적인 미세 조정(PEFT)은 학습 가능한 파라미터를 줄이지만, 비-IID 환경에서는 각 라운드의 전송량이 여전히 매우 커서 균일한 압축은 중요한 정보를 포함하는 희귀 신호를 삭제할 수 있습니다. 본 논문에서는 통신 효율적인 연합 LLM 미세 조정을 위한 Fisher 가이드 토큰 양자화 시스템인 Fed-FSTQ를 제안합니다. Fed-FSTQ는 가벼운 Fisher 프록시를 사용하여 토큰 민감도를 추정하고, 중요도 기반 토큰 선택과 비균일 혼합 정밀도 양자화를 결합하여 유용한 정보에는 높은 정확도를 할당하고 중복 전송을 억제합니다. 이 방법은 모델에 독립적이며, LoRA와 같은 표준 연합 PEFT 파이프라인에 추가 가능한 모듈로 작동하며, 서버 집계 규칙을 수정하지 않고, compact sparse 메시지 패킹을 통해 다양한 대역폭을 가진 클라이언트를 지원합니다. 비-IID 환경에서 다국어 질의응답 및 의료 질의응답 실험 결과, Fed-FSTQ는 표준 LoRA 기준에 비해 특정 품질 임계값에 도달하는 데 필요한 누적 업링크 트래픽을 46배 줄이고, 전체 훈련 시간을 52% 개선합니다. 또한, 추론 단계에서 Fisher 가이드 토큰 감소를 적용하면 NVIDIA Jetson과 같은 엣지 장치에서 최대 1.55배의 속도 향상을 얻을 수 있으며, 이는 제한된 리소스 환경에서도 활용 가능함을 보여줍니다.

Original Abstract

Federated fine-tuning provides a practical route to adapt large language models (LLMs) on edge devices without centralizing private data, yet in mobile deployments the training wall-clock is often bottlenecked by straggler-limited uplink communication under heterogeneous bandwidth and intermittent participation. Although parameter-efficient fine-tuning (PEFT) reduces trainable parameters, per-round payloads remain prohibitive in non-IID regimes, where uniform compression can discard rare but task-critical signals. We propose Fed-FSTQ, a Fisher-guided token quantization system primitive for communication-efficient federated LLM fine-tuning. Fed-FSTQ employs a lightweight Fisher proxy to estimate token sensitivity, coupling importance-aware token selection with non-uniform mixed-precision quantization to allocate higher fidelity to informative evidence while suppressing redundant transmission. The method is model-agnostic, serves as a drop-in module for standard federated PEFT pipelines, e.g., LoRA, without modifying the server aggregation rule, and supports bandwidth-heterogeneous clients via compact sparse message packing. Experiments on multilingual QA and medical QA under non-IID partitions show that Fed-FSTQ reduces cumulative uplink traffic required to reach a fixed quality threshold by 46x relative to a standard LoRA baseline, and improves end-to-end wall-clock time-to-accuracy by 52%. Furthermore, enabling Fisher-guided token reduction at inference yields up to a 1.55x end-to-end speedup on NVIDIA Jetson-class edge devices, demonstrating deployability under tight resource constraints.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!