2603.26786v1 Mar 25, 2026 cs.LG

다중 모달 대규모 언어 모델의 연합 사전 훈련을 위한 첫걸음

A Step Toward Federated Pretraining of Multimodal Large Language Models

Xiaoshan Yang

Citations: 1,959

h-index: 24

Changsheng Xu

Citations: 217

h-index: 7

Yaguang Song

Citations: 0

h-index: 0

Baochen Xiong

Citations: 247

h-index: 6

Yifan Xu

Citations: 465

h-index: 6

Yaowei Wang

Citations: 146

h-index: 6

다중 모달 대규모 언어 모델(MLLM)의 빠른 발전은 고품질 공개 데이터의 부족으로 인해 제약을 받고 있으며, 방대한 양의 다양한 다중 모달 데이터는 개인 정보 보호 문제로 인해 접근이 제한되어 있습니다. 연합 학습(FL)은 이러한 분산 자원을 활용할 수 있는 유망한 해결책을 제공하지만, 기존 연구는 주로 미세 조정에 초점을 맞추고 있으며, 기본적인 사전 훈련 단계는 상대적으로 탐구되지 않았습니다. 본 논문에서는 연합 다중 모달 언어 모델 정렬(Fed-MA)이라는 경량화된 사전 훈련 패러다임을 공식적으로 소개합니다. 이 패러다임은 시각 인코더와 LLM을 고정하고, 교차 모달 투사기를 협력적으로 훈련합니다. 우리는 이 설정에서 두 가지 중요한 과제를 발견했습니다. (i) 로컬 투사기를 집계할 때 발생하는 파라미터 간 간섭; (ii) 단일 패스 협력적 SGD에서 발생하는 기울기 진동. 이러한 과제를 해결하기 위해, 우리는 연합 MLLM 사전 훈련을 위한 선도적인 프레임워크인 Fed-CMP를 제안합니다. Fed-CMP는 Canonical Reliability-Aware Aggregation을 사용하여, 클라이언트 투사기를 공유 정렬 기준과 클라이언트별 계수로 분해하는 정규 공간을 구축하고, 신뢰성 기반 가중치 융합을 통해 파라미터 간 간섭을 억제합니다. 또한, Fed-CMP는 Orthogonality-Preserved Momentum을 도입하여, 직교 투영을 통해 공유 정렬 기준에 모멘텀을 적용하여, 과거 최적화 방향을 누적하면서 기하학적 구조를 유지합니다. 우리는 공개 데이터 세트를 기반으로 네 가지 연합 사전 훈련 시나리오를 구성했으며, 광범위한 실험을 통해 Fed-CMP가 기존의 기본 모델보다 훨씬 우수한 성능을 발휘함을 확인했습니다.

Original Abstract

The rapid evolution of Multimodal Large Language Models (MLLMs) is bottlenecked by the saturation of high-quality public data, while vast amounts of diverse multimodal data remain inaccessible in privacy-sensitive silos. Federated Learning (FL) offers a promising solution to unlock these distributed resources, but existing research focuses predominantly on fine-tuning, leaving the foundational pre-training phase largely unexplored. In this paper, we formally introduce the Federated MLLM Alignment (Fed-MA) task, a lightweight pre-training paradigm that freezes the vision encoder and LLM while collaboratively training the cross-modal projector. We identify two critical challenges in this setting: (i) parameter interference in aggregating local projectors; and (ii) gradient oscillations in one-pass collaborative SGD. To address these challenges, we propose Fed-CMP, a pioneering framework for federated MLLM pre-training. Fed-CMP employs Canonical Reliability-Aware Aggregation, which constructs a canonical space to decompose client projectors into a shared alignment basis and client-specific coefficients, then performs reliability-weighted fusion to suppress parameter interference. Furthermore, Fed-CMP introduces Orthogonality-Preserved Momentum, which applies momentum to the shared alignment basis via orthogonal projection, accumulating historical optimization directions while preserving geometric structure. We construct four federated pre-training scenarios based on public datasets, and extensive experiments validate that Fed-CMP significantly outperforms existing baselines.

0 Citations

0 Influential

12 Altmetric

60.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!