2602.00485v1 Jan 31, 2026 cs.AI

파라미터를 선호도로 대체하기: 이기종 시각-언어 모델의 연합 정렬

Replacing Parameters with Preferences: Federated Alignment of Heterogeneous Vision-Language Models

Xiaoshan Yang

Citations: 1,959

h-index: 24

Hongwei Zheng

Citations: 445

h-index: 10

Shule Lu

Citations: 0

h-index: 0

Yujing Wang

Citations: 20

h-index: 2

Hainan Zhang

Citations: 72

h-index: 4

Yongxin Tong

Citations: 6

h-index: 1

Changsheng Xu

Citations: 217

h-index: 7

Zhiming Zheng

Citations: 75

h-index: 5

시각-언어 모델(VLM)은 헬스케어나 금융과 같이 프라이버시에 민감한 도메인에서 폭넓은 잠재력을 가지고 있지만, 엄격한 데이터 공유 제약으로 인해 중앙 집중식 학습이 불가능하다. 연합 학습(FL)은 분산 학습을 가능하게 하여 이 문제를 완화하지만, 실제 배포 환경에서는 계산 자원, 애플리케이션 요구 사항, 모델 아키텍처에 따른 클라이언트의 이기종성으로 인해 어려움을 겪는다. 우리는 데이터를 모델 파라미터로 대체하는 것이 현재의 FL을 특징짓는다면, 파라미터를 선호도로 대체하는 것은 보다 확장 가능하고 프라이버시를 보존하는 미래를 대변한다고 주장한다. 이러한 관점에 기반하여, 우리는 이기종 VLM을 위한 보상 혼합(Mixture-of-Rewards)과 GRPO 기반의 연합 정렬 프레임워크인 MoR을 제안한다. MoR은 시각 파운데이션 모델을 KL 정규화된 참조 모델로 초기화하며, 각 클라이언트는 로컬 선호도 주석을 사용하여 보상 모델을 개별적으로 학습시킴으로써 원시 데이터를 노출하지 않고 특정 평가 신호를 포착한다. 이기종 보상을 조정하기 위해 우리는 클라이언트 보상 신호를 적응적으로 집계하는 라우팅 기반 융합 메커니즘을 도입한다. 마지막으로 서버는 이 혼합된 보상을 사용하여 GRPO를 수행함으로써 베이스 VLM을 최적화한다. 세 가지 공개 VQA 벤치마크에 대한 실험은 MoR이 일반화, 견고성, 클라이언트 간 적응성 측면에서 연합 정렬 베이스라인 모델들을 일관되게 능가함을 보여준다. 우리의 접근 방식은 연합 설정 하에서 이기종 VLM의 프라이버시 보존 정렬을 위한 확장 가능한 솔루션을 제공한다.

Original Abstract

VLMs have broad potential in privacy-sensitive domains such as healthcare and finance, yet strict data-sharing constraints render centralized training infeasible. FL mitigates this issue by enabling decentralized training, but practical deployments face challenges due to client heterogeneity in computational resources, application requirements, and model architectures. We argue that while replacing data with model parameters characterizes the present of FL, replacing parameters with preferences represents a more scalable and privacy-preserving future. Motivated by this perspective, we propose MoR, a federated alignment framework based on GRPO with Mixture-of-Rewards for heterogeneous VLMs. MoR initializes a visual foundation model as a KL-regularized reference, while each client locally trains a reward model from local preference annotations, capturing specific evaluation signals without exposing raw data. To reconcile heterogeneous rewards, we introduce a routing-based fusion mechanism that adaptively aggregates client reward signals. Finally, the server performs GRPO with this mixed reward to optimize the base VLM. Experiments on three public VQA benchmarks demonstrate that MoR consistently outperforms federated alignment baselines in generalization, robustness, and cross-client adaptability. Our approach provides a scalable solution for privacy-preserving alignment of heterogeneous VLMs under federated settings.

0 Citations

0 Influential

12 Altmetric

60.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!