2603.08412v1 Mar 09, 2026 cs.CL

환상에 부합하기: 인간 및 AI 피드백에서의 선택 무의식

Aligning to Illusions: Choice Blindness in Human and AI Feedback

Citations: 4

h-index: 1

인간 피드백 강화 학습(RLHF)은 평가자의 선호도가 안정적인 내부 상태를 반영한다고 가정합니다. 우리는 세 가지 실험을 통해 이 가정을 검증합니다. 인간의 선택 무의식 연구에서, 91%의 은밀하게 변경된 선호도가 감지되지 않았으며, 이는 익숙하지 않은 텍스트에 대한 제3자의 평가 비교에서도 선택 무의식을 확장하는 것을 의미합니다. 15개의 LLM 평가자를 잠재적인 대체 모델로 테스트한 결과, 감지는 진정한 자기 검토보다는 피상적인 텍스트 매칭에 의존한다는 것을 발견했습니다. 이전 추론을 맥락에서 제거하면, 감지되지 않는 비율이 거의 0%에서 50% 이상으로 급증합니다. 반면, 명시적인 사회적 압력은 거의 보편적인 순응을 유도합니다. 86M에서 2B 파라미터의 두 가지 아키텍처를 대상으로 한 용량-반응 실험에서, 보상 신호가 절반으로 감소하기 전에 라벨의 1/6에서 1/3 정도가 손상되어야 합니다. 그러나 표준적인 쌍별 정확도는 거의 변하지 않습니다. Best-of-N 평가 결과, 이는 하위 작업의 성능 저하로 이어집니다. 50%의 라벨 손상 시, 보상 기반 선택은 무작위 샘플링보다 성능이 향상되지 않으며, 동시에 프록시 모델은 단조적으로 증가하는 점수를 보고합니다. 종합적으로, 이러한 결과는 선호도 구성 문제를 드러냅니다. 즉, RLHF에 입력되는 신호는 인간의 메타인지, LLM의 자기 모니터링, 그리고 표준 평가 지표로는 감지할 수 없는 방식으로, 유도 환경에 의해 형성됩니다.

Original Abstract

Reinforcement Learning from Human Feedback (RLHF) assumes annotator preferences reflect stable internal states. We challenge this through three experiments spanning the preference pipeline. In a human choice blindness study, 91% of surreptitiously swapped preferences go undetected, extending choice blindness to third-person evaluative comparison of unfamiliar text. Testing fifteen LLM judges as potential replacements, we find detection relies on shallow text matching rather than genuine self-monitoring: removing prior reasoning from context causes blindness to surge from near-zero to over 50%, while explicit social pressure induces near-universal compliance. In a dose-response experiment across two architectures from 86M to 2B parameters, one-sixth to one-third of labels must be corrupted before the reward signal halves, yet standard pairwise accuracy remains virtually unchanged. A Best-of-N evaluation confirms this translates to downstream policy degradation: at 50% corruption, reward-guided selection produces no improvement over random sampling, while the proxy model reports monotonically increasing scores. Together, these results reveal a preference construction problem: the signal entering RLHF is shaped by elicitation context in ways that neither human metacognition, LLM self-monitoring, nor standard evaluation metrics can detect.

0 Citations

0 Influential

0.5 Altmetric

2.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!