2601.18731v1 Jan 26, 2026 cs.CL

모든 사용자에게 적응: 개인 맞춤형 LLM 정렬을 위한 메타 보상 모델링

One Adapts to Any: Meta Reward Modeling for Personalized LLM Alignment

Yongqi Li

Citations: 1,485

h-index: 15

Hongru Cai

Citations: 99

h-index: 4

Wenjie Li

Citations: 807

h-index: 10

Fuli Feng

Citations: 1,400

h-index: 20

Tiezheng Yu

Citations: 11

h-index: 1

Wenjie Wang

Citations: 3

h-index: 1

Fengbin Zhu

National University of Singapore

Citations: 1,375

h-index: 14

대규모 언어 모델(LLM)의 정렬은 모델의 출력 결과를 인간의 선호도에 맞추는 것을 목표로 하며, 개인 맞춤형 정렬은 모델을 개별 사용자에 맞게 더욱 세밀하게 조정합니다. 이는 사용자별 선호도를 반영하고 개별적인 피드백을 자동으로 제공하는 개인 맞춤형 보상 모델에 의존합니다. 그러나 이러한 모델을 개발하는 데는 두 가지 중요한 과제가 있습니다. 즉, 개별 사용자로부터 얻을 수 있는 피드백의 부족과, 아직 경험하지 못한 사용자에게 효율적으로 적응해야 한다는 요구입니다. 우리는 이러한 제약을 해결하기 위해서는 데이터 피팅을 통한 사용자 선호도 학습 방식에서 벗어나, 선호도 적응 과정을 학습하는 패러다임 전환이 필요하다고 주장합니다. 이를 실현하기 위해, 우리는 메타 보상 모델링(MRM)을 제안합니다. MRM은 개인 맞춤형 보상 모델링을 메타 학습 문제로 재구성합니다. 구체적으로, 각 사용자의 보상 모델을 기본 보상 함수의 가중치 조합으로 표현하고, 모델-독립형 메타 학습(MAML) 스타일의 프레임워크를 사용하여 이러한 가중치의 초기값을 최적화하여 제한된 피드백 환경에서도 빠른 적응을 지원합니다. 또한, 견고성을 확보하기 위해, 메타 최적화 과정에서 학습하기 어려운 사용자에 더 큰 가중치를 부여하는 강력한 개인 맞춤형 목표(RPO)를 도입했습니다. 개인 맞춤형 선호도 데이터 세트에 대한 광범위한 실험 결과, MRM은 소규모 데이터로도 개인 맞춤형 성능을 향상시키고, 사용자별 견고성을 개선하며, 기존 방법보다 우수한 성능을 보인다는 것을 입증했습니다.

Original Abstract

Alignment of Large Language Models (LLMs) aims to align outputs with human preferences, and personalized alignment further adapts models to individual users. This relies on personalized reward models that capture user-specific preferences and automatically provide individualized feedback. However, developing these models faces two critical challenges: the scarcity of feedback from individual users and the need for efficient adaptation to unseen users. We argue that addressing these constraints requires a paradigm shift from fitting data to learn user preferences to learn the process of preference adaptation. To realize this, we propose Meta Reward Modeling (MRM), which reformulates personalized reward modeling as a meta-learning problem. Specifically, we represent each user's reward model as a weighted combination of base reward functions, and optimize the initialization of these weights using a Model-Agnostic Meta-Learning (MAML)-style framework to support fast adaptation under limited feedback. To ensure robustness, we introduce the Robust Personalization Objective (RPO), which places greater emphasis on hard-to-learn users during meta optimization. Extensive experiments on personalized preference datasets validate that MRM enhances few-shot personalization, improves user robustness, and consistently outperforms baselines.

1 Citations

0 Influential

10 Altmetric

51.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!