2601.17329v1 Jan 24, 2026 cs.LG

적응형 피드백 정렬: 강력한 LLM 정렬을 위한 응답 수준 신뢰도 정량화

Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment

Tiejin Chen

Citations: 287

h-index: 8

Xiaoou Liu

Citations: 152

h-index: 4

Vishnu Nandam

Citations: 23

h-index: 2

Kuan-Ru Liou

Citations: 60

h-index: 2

Hua Wei

Citations: 52

h-index: 3

강화 학습 기반 인간 피드백(RLHF)과 같은 선호도 기반 정렬은 쌍별 선호도를 학습하지만, 이러한 레이블은 종종 노이즈가 많고 일관성이 없습니다. 기존의 불확실성을 고려하는 방법들은 선호도에 가중치를 부여하지만, 비교되는 응답 자체의 신뢰성과 같은 더 근본적인 요인을 간과합니다. 이러한 문제를 해결하기 위해, 우리는 Conformal Prediction (CP)의 통계적 보장을 기반으로 선호도 가중치를 결정하는 프레임워크인 Conformal Feedback Alignment (CFA)를 제안합니다. CFA는 제어 가능한 범위의 정확도를 갖는 컨포멀 예측 집합을 구성하여 응답 수준의 신뢰도를 정량화하고, 이러한 신뢰도들을 DPO 및 PPO 스타일 학습 모두에 사용될 수 있는 체계적인 가중치로 통합합니다. 다양한 데이터셋에 대한 실험 결과, CFA는 정렬의 견고성과 데이터 효율성을 향상시키는 것으로 나타났으며, 이는 응답 측면의 불확실성을 모델링하는 것이 선호도 수준의 가중치 부여를 보완하고 더 견고하고 데이터 효율적인 정렬을 가능하게 한다는 것을 보여줍니다. 코드는 여기에 제공됩니다.

Original Abstract

Preference-based alignment like Reinforcement Learning from Human Feedback (RLHF) learns from pairwise preferences, yet the labels are often noisy and inconsistent. Existing uncertainty-aware approaches weight preferences, but ignore a more fundamental factor: the reliability of the \emph{answers} being compared. To address the problem, we propose Conformal Feedback Alignment (CFA), a framework that grounds preference weighting in the statistical guarantees of Conformal Prediction (CP). CFA quantifies answer-level reliability by constructing conformal prediction sets with controllable coverage and aggregates these reliabilities into principled weights for both DPO- and PPO-style training. Experiments across different datasets show that CFA improves alignment robustness and data efficiency, highlighting that modeling \emph{answer-side} uncertainty complements preference-level weighting and yields more robust, data-efficient alignment. Codes are provided here.

2 Citations

0 Influential

4 Altmetric

22.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!