2603.29871v1 Mar 31, 2026 cs.AI

ShapE-GRPO: Shapley 값을 활용한 보상 할당 방법 - 다중 후보 LLM 학습

ShapE-GRPO: Shapley-Enhanced Reward Allocation for Multi-Candidate LLM Training

Rui Ai

Citations: 15

h-index: 2

Yunqing Pan

Citations: 26

h-index: 1

David Simchi-Levi

Citations: 47

h-index: 4

Chonghuan Wang

Citations: 7

h-index: 1

추천, 브레인스토밍, 코드 제안 등 사용자-에이전트 상호작용 시나리오에서, 대규모 언어 모델(LLM)은 종종 후보 추천 집합을 생성하며, 이때 목표는 개별 후보가 독립적으로 성능을 최대화하는 것이 아니라 전체 집합의 전체적인 유용성을 최대화하는 것입니다. 그러나 기존 강화 학습 후처리 훈련 패러다임, 예를 들어 그룹 상대 정책 최적화(GRPO)는 일반적으로 집합 내의 모든 후보에 동일한 집합 수준의 스칼라 보상을 할당합니다. 이는 노이즈가 심한 훈련 신호를 발생시키는데, 왜냐하면 성능이 낮은 후보들이 단일 강점 있는 후보가 생성하는 높은 보상을 '무임승차'하여 최적의 탐색을 방해하기 때문입니다. 이러한 문제를 해결하기 위해, Shapley 값을 활용한 GRPO(ShapE-GRPO)를 제안합니다. 우리는 집합 수준의 유용성의 순열 불변성을 활용하여, 협력 게임 이론에서 파생된 Shapley 값을 활용한 수식을 통해 집합 수준의 보상을 세분화된, 후보별 신호로 분해합니다. 우리의 수식이 Shapley 값의 기본 공리를 유지하면서도 다항 시간 복잡도를 가지는 계산적으로 효율적임을 보여줍니다. 실험적으로, ShapE-GRPO는 다양한 데이터셋에서 기존 GRPO보다 일관되게 우수한 성능을 보이며, 훈련 과정에서 더 빠른 수렴 속도를 보입니다.

Original Abstract

In user-agent interaction scenarios such as recommendation, brainstorming, and code suggestion, Large Language Models (LLMs) often generate sets of candidate recommendations where the objective is to maximize the collective utility of the entire set rather than individual candidates independently. However, existing reinforcement learning post-training paradigms, such as Group Relative Policy Optimization (GRPO), typically assign the same set-level scalar reward to every candidate in the set. This leads to noisy training signals where poor candidates free-ride on the high reward produced by a single strong peer, resulting in suboptimal exploration. To address this, we propose Shapley-Enhanced GRPO (ShapE-GRPO). By leveraging the permutation-invariant nature of set-level utility, we derive a Shapley-enhanced formulation from cooperative game theory to decompose set-level rewards into granular, candidate-specific signals. We show that our formulation preserves the fundamental axioms of the Shapley value while remaining computationally efficient with polynomial-time complexity. Empirically, ShapE-GRPO consistently outperforms standard GRPO across diverse datasets with accelerated convergence during training.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!