2605.03426v1 May 05, 2026 cs.AI

파라미터를 선호도로 대체: 이기종 비전-언어 모델의 연합 학습 기반 정렬

Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

Xiaoshan Yang

Citations: 2,026

h-index: 25

Hongwei Zheng

Citations: 493

h-index: 12

Shule Lu

Citations: 0

h-index: 0

Yujing Wang

Citations: 25

h-index: 2

Hainan Zhang

Citations: 91

h-index: 4

Yongxin Tong

Citations: 8

h-index: 1

Changsheng Xu

Citations: 238

h-index: 7

Zhiming Zheng

Citations: 91

h-index: 5

비전-언어 모델(VLMs)은 의료 및 금융과 같은 개인 정보 보호가 중요한 분야에서 광범위한 잠재력을 가지고 있지만, 엄격한 데이터 공유 제약으로 인해 중앙 집중식 학습이 불가능합니다. 연합 학습은 분산 학습을 가능하게 하여 이러한 문제를 완화하지만, 계산 자원, 애플리케이션 요구 사항 및 모델 아키텍처의 클라이언트 이질성으로 인해 실제 적용에는 어려움이 있습니다. 극단적인 모델 및 데이터 이질성 하에서는 파라미터 집계를 선호도 기반 협업으로 대체하는 것이 더 적합한 인터페이스를 제공하며, 이는 직접적인 파라미터 또는 데이터 교환의 필요성을 없애줍니다. 이러한 동기를 바탕으로, 우리는 이기종 VLM을 위한 연합 학습 기반 정렬 프레임워크인 MoR을 제안합니다. MoR은 GRPO와 Mixture-of-Rewards를 결합하여 각 클라이언트가 로컬 선호도 주석으로부터 로컬 보상 모델을 학습하여, 원본 데이터를 노출하지 않고 특정 평가 신호를 캡처합니다. 이러한 이기종 감독 신호를 결합하기 위해, MoR은 학습된 라우팅을 사용하는 Mixture-of-Rewards 메커니즘을 도입하여 입력 및 정렬 목표에 따라 클라이언트 보상 모델을 적응적으로 융합합니다. 서버는 GRPO와 참조 모델에 대한 KL 페널티를 사용하여 기본 VLM을 최적화하여, 클라이언트 모델이 아키텍처나 파라미터를 공유하지 않고도 선호도 기반 정렬을 가능하게 합니다. 다양한 공개 비전-언어 벤치마크에서 수행한 실험 결과, MoR은 일반화 및 클라이언트 간 적응성 측면에서 연합 학습 기반 정렬의 기존 방법보다 일관되게 우수한 성능을 보였습니다. 우리의 접근 방식은 연합 환경에서 이기종 VLM의 개인 정보 보호 정렬을 위한 확장 가능한 솔루션을 제공합니다.

Original Abstract

Vision-Language Models (VLMs) have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. Federated Learning mitigates this issue by enabling decentralized training, but practical deployments face challenges due to client heterogeneity in computational resources, application requirements, and model architectures. Under extreme model and data heterogeneity, replacing parameter aggregation with preference-based collaboration offers a more suitable interface, as it eliminates the need for direct parameter or data exchange. Motivated by this, we propose MoR, a federated alignment framework that combines GRPO with Mixture-of-Rewards for heterogeneous VLMs. In MoR, each client locally trains a reward model from local preference annotations, capturing specific evaluation signals without exposing raw data. To combine these heterogeneous supervision signals, MoR introduces a Mixture-of-Rewards mechanism with learned routing, which adaptively fuses client reward models according to the input and alignment objective. The server then optimizes a base VLM using GRPO with a KL penalty to a reference model, enabling preference alignment without requiring client models to share architectures or parameters. Experiments on diverse public vision-language benchmarks demonstrate that MoR consistently outperforms federated alignment baselines in generalization and cross-client adaptability. Our approach provides a scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings.

0 Citations

0 Influential

12.5 Altmetric

62.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!