2602.07799v1 Feb 08, 2026 cs.LG

공정성을 고려한 보상 최적화

Fairness Aware Reward Optimization

Ching Lam Choi

Citations: 133

h-index: 2

Vighnesh Subramaniam

Citations: 145

h-index: 4

Phillip Isola

Citations: 3

h-index: 1

Antonio Torralba

Citations: 73

h-index: 4

Stefanie Jegelka

Citations: 217

h-index: 6

인간 선호도 데이터의 인구 통계적 편향은 보상 모델을 통해 정렬된 LLM(대규모 언어 모델)에 체계적인 불공정성을 야기합니다. 본 논문에서는 공정성을 고려한 보상 최적화(Fairness Aware Reward Optimization, Faro)라는 새로운 프레임워크를 제안합니다. Faro는 보상 모델을 훈련할 때 인구 통계적 동일성, 균등한 기회 또는 반사실적 공정성 제약 조건을 적용합니다. 우리는 LLM 정렬에서 보상 수준의 공정성에 대한 최초의 이론적 분석을 제공하며, 다음과 같은 결과를 얻었습니다. (i) Faro로 훈련된 보상에 대해 제어 가능한 여유를 갖는 검증 가능한 공정성 인증서; (ii) KL 정규화를 통해 유도되는 정확도-공정성 균형의 공식적인 특징 규명, 이를 통해 보상에서 정책으로 공정성이 전달됨을 증명; 및 (iii) 비어 있지 않은 파레토 프런티어의 존재. 사전 처리 및 사후 처리 방법과 달리, Faro는 보상 모델이 동시에 순위(정확한 순위), 값(보정), 그리고 공정성을 갖도록 보장합니다. 다양한 LLM과 벤치마크에서 실험한 결과, Faro는 모델 품질을 유지하거나 향상시키는 동시에 편향 및 유해한 콘텐츠 생성을 크게 줄입니다.

Original Abstract

Demographic skews in human preference data propagate systematic unfairness through reward models into aligned LLMs. We introduce Fairness Aware Reward Optimization (Faro), an in-processing framework that trains reward models under demographic parity, equalized odds, or counterfactual fairness constraints. We provide the first theoretical analysis of reward-level fairness in LLM alignment, establishing: (i) provable fairness certificates for Faro-trained rewards with controllable slack; a (ii) formal characterization of the accuracy-fairness trade-off induced by KL-regularized fine-tuning, proving fairness transfers from reward to policy; and the (iii) existence of a non-empty Pareto frontier. Unlike pre- and post-processing methods, Faro ensures reward models are simultaneously ordinal (ranking correctly), cardinal (calibrated), and fair. Across multiple LLMs and benchmarks, Faro significantly reduces bias and harmful generations while maintaining or improving model quality.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!