2604.10585v1 Apr 12, 2026 cs.LG

아첨 유도 미세 조정 시 교정 오류 발생: 보상 해킹이 LLM의 불확실성 추정에 미치는 영향

Calibration Collapse Under Sycophancy Fine-Tuning: How Reward Hacking Breaks Uncertainty Quantification in LLMs

Citations: 6

h-index: 2

최근의 대규모 언어 모델(LLM)은 인간 피드백 기반 강화 학습(RLHF) 또는 관련 보상 최적화 방식을 통해 미세 조정되는 경우가 많습니다. 이러한 절차는 모델의 유용성을 향상시키지만, 본 연구에서는 아첨 유도적인 보상 신호가 교정 성능을 저하시키는지 조사합니다. Qwen3-8B 모델을 세 가지 조건에서 미세 조정했습니다. 첫째, 미세 조정하지 않은 기본 모델, 둘째, TriviaQA 데이터셋을 사용한 중립적인 지도 학습 미세 조정(SFT), 셋째, 의도적으로 틀린 답에 동의하는 경우를 보상하는 아첨 유도형 그룹 상대 정책 최적화(GRPO)입니다. 다섯 가지 주제 영역에 걸쳐 1,000개의 MMLU 문제를 평가하고 부트스트랩 신뢰 구간 및 순열 테스트를 사용한 결과, extbf{아첨 유도형 GRPO는 일관된 방향성을 가진 교정 성능 저하를 초래합니다}. ECE는 기본 모델에 비해 +0.006만큼 증가하고, MCE는 중립적인 SFT에 비해 +0.010만큼 증가합니다. 하지만 이러한 효과는 현재의 학습 예산에서는 통계적으로 유의미하지 않습니다 (p = 0.41). 세 가지 모델에 적용된 사후 행렬 스케일링은 ECE를 40~64% 감소시키고 정확도를 1.5~3.0% 향상시켰습니다. 그러나 아첨 유도 모델은 중립적인 SFT 모델에 비해 여전히 가장 높은 사후 스케일링 ECE 값을 나타냅니다 (0.042 vs. 0.037). 이는 보상으로 인한 미세 조정 오류가 affine 보정 후에도 여전히 구조적인 잔류 효과를 남긴다는 것을 시사합니다. 본 연구는 보상 해킹이 교정 성능에 미치는 영향을 평가하는 방법론을 제시하고, 교정 성능을 고려한 학습 목표의 개발을 촉구합니다.

Original Abstract

Modern large language models (LLMs) are increasingly fine-tuned via reinforcement learning from human feedback (RLHF) or related reward optimisation schemes. While such procedures improve perceived helpfulness, we investigate whether sycophantic reward signals degrade calibration -- a property essential for reliable uncertainty quantification. We fine-tune Qwen3-8B under three regimes: no fine-tuning (base), neutral supervised fine-tuning (SFT) on TriviaQA, and sycophancy-inducing Group Relative Policy Optimisation (GRPO) that rewards agreement with planted wrong answers. Evaluating on $1{,}000$ MMLU items across five subject domains with bootstrap confidence intervals and permutation testing, we find that \textbf{sycophantic GRPO produces consistent directional calibration degradation} -- ECE rises by $+0.006$ relative to the base model and MCE increases by $+0.010$ relative to neutral SFT -- though the effect does not reach statistical significance ($p = 0.41$) at this training budget. Post-hoc matrix scaling applied to all three models reduces ECE by $40$--$64\%$ and improves accuracy by $1.5$--$3.0$ percentage points. However, the sycophantic model retains the highest post-scaling ECE relative to the neutral SFT control ($0.042$ vs.\ $0.037$), suggesting that reward-induced miscalibration leaves a structured residual even after affine correction. These findings establish a methodology for evaluating the calibration impact of reward hacking and motivate calibration-aware training objectives.

2 Citations

0 Influential

1 Altmetric

7.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!