2602.21765v1 Feb 25, 2026 cs.LG

보상 변화 및 KL 정규화 클리핑 하에서의 강화 학습 기반 인간 피드백(RLHF) 일반화

Generalisation of RLHF under Reward Shift and Clipped KL Regularisation

Yuzhu Chen

Citations: 22

h-index: 3

Fengxiang He

Citations: 18

h-index: 3

Ke Tang

Citations: 292

h-index: 7

대규모 언어 모델의 정렬 및 적응은 강화 학습 기반 인간 피드백(RLHF)에 크게 의존하지만, 학습된 보상이 변화하고 KL 제어가 추정 및 클리핑되는 경우, RLHF의 일반화 가능성에 대한 이론적 이해는 아직 초기 단계입니다. 이 문제를 해결하기 위해, 우리는 RLHF의 일반화 이론을 개발하여 다음과 같은 요소를 명시적으로 고려합니다. (1) extit{보상 변화}: 보상 모델은 초기 또는 혼합 행동 정책에서 수집된 선호도 데이터를 사용하여 학습되지만, RLHF는 자체 롤아웃을 기반으로 현재 정책을 최적화합니다. (2) extit{클리핑된 KL 정규화}: KL 정규화기는 샘플링된 로그 확률 비율에서 추정되고 안정화를 위해 클리핑되므로 RLHF에 오류가 발생합니다. 우리는 RLHF의 일반화 경계를 제시하며, 일반화 오류는 프롬프트 및 롤아웃에서 발생하는 샘플링 오류, 보상 변화 오류, 그리고 KL 클리핑 오류에서 비롯된다는 것을 보여줍니다. 또한, (1) RLHF 파라미터를 유한한 공간에 걸친 균일 사전 분포로 초기화하는 경우, 그리고 (2) 확률적 경사 하강법을 사용하여 RLHF를 훈련하는 경우, 오르슈타인-울렌벡 프로세스라는 특별한 경우를 논의합니다. 이 이론은 (1) 최적의 KL 클리핑 임계값 및 (2) 프롬프트, 롤아웃 및 선호도 데이터에 대한 예산 할당에 대한 실질적인 의미를 제공합니다.

Original Abstract

Alignment and adaptation in large language models heavily rely on reinforcement learning from human feedback (RLHF); yet, theoretical understanding of its generalisability remains premature, especially when the learned reward could shift, and the KL control is estimated and clipped. To address this issue, we develop generalisation theory for RLHF that explicitly accounts for (1) \emph{reward shift}: reward models are trained on preference data from earlier or mixed behaviour policies while RLHF optimises the current policy on its own rollouts; and (2) \emph{clipped KL regularisation}: the KL regulariser is estimated from sampled log-probability ratios and then clipped for stabilisation, resulting in an error to RLHF. We present generalisation bounds for RLHF, suggesting that the generalisation error stems from a sampling error from prompts and rollouts, a reward shift error, and a KL clipping error. We also discuss special cases of (1) initialising RLHF parameters with a uniform prior over a finite space, and (2) training RLHF by stochastic gradient descent, as an Ornstein-Uhlenbeck process. The theory yields practical implications in (1) optimal KL clipping threshold, and (2) budget allocation in prompts, rollouts, and preference data.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!