2604.17112v1 Apr 18, 2026 cs.AI

자기 일관성을 모델 간 불일치와 결합하여 불확실성 정량화

Complementing Self-Consistency with Cross-Model Disagreement for Uncertainty Quantification

Marzyeh Ghassemi

Citations: 97

h-index: 5

Kimia Hamidieh

Citations: 230

h-index: 6

Veronika Thost

Citations: 37

h-index: 3

Walter Gerych

Citations: 100

h-index: 4

Mikhail Yurochkin

Citations: 61

h-index: 2

대규모 언어 모델(LLM)은 종종 확신에 찬 답변을 내놓지만, 실제로는 틀린 경우가 많으며, 불확실성 정량화는 이러한 문제를 해결하고 모델의 활용성을 높이는 잠재적인 방법입니다. 최근 연구에서는 주로 자기 일관성(self-consistency)을 사용하여 우연적 불확실성(aleatoric uncertainty, AU)을 추정하지만, 모델이 과도하게 확신하고 여러 샘플에서 동일한 틀린 답변을 생성하는 경우 이러한 방법은 효과가 떨어집니다. 본 연구에서는 이러한 현상을 분석하고, AU가 낮은 경우 틀린 답변에서 모델 간 의미적 불일치가 더 높다는 것을 보여줍니다. 이를 바탕으로, 블랙박스 환경에서 작동하는 지식적 불확실성(epistemic uncertainty, EU) 항을 도입합니다. EU는 소규모의, 규모가 일치하는 모델 앙상블에서 생성된 텍스트만을 사용하며, 모델 간 시퀀스-의미 유사성과 모델 내 시퀀스-의미 유사성 간의 차이를 계산하여 EU 값을 산출합니다. 그런 다음, 총 불확실성(total uncertainty, TU)을 AU와 EU의 합으로 정의합니다. 7~9B 파라미터의 5가지 instruction-tuned 모델과 10가지 장문 생성 태스크에 대한 종합적인 연구를 통해 TU는 AU에 비해 순위 정확도를 향상시키고, 선택적 회피(selective abstention)를 개선하며, AU가 낮은 경우에도 확신에 찬 실패 사례를 안정적으로 식별합니다. 또한, EU가 가장 유용한 상황을 일치도 및 상호 보완성 진단(agreement and complementarity diagnostics)을 통해 분석합니다.

Original Abstract

Large language models (LLMs) often produce confident yet incorrect responses, and uncertainty quantification is one potential solution to more robust usage. Recent works routinely rely on self-consistency to estimate aleatoric uncertainty (AU), yet this proxy collapses when models are overconfident and produce the same incorrect answer across samples. We analyze this regime and show that cross-model semantic disagreement is higher on incorrect answers precisely when AU is low. Motivated by this, we introduce an epistemic uncertainty (EU) term that operates in the black-box access setting: EU uses only generated text from a small, scale-matched ensemble and is computed as the gap between inter-model and intra-model sequence-semantic similarity. We then define total uncertainty (TU) as the sum of AU and EU. In a comprehensive study across five 7-9B instruction-tuned models and ten long-form tasks, TU improves ranking calibration and selective abstention relative to AU, and EU reliably flags confident failures where AU is low. We further characterize when EU is most useful via agreement and complementarity diagnostics.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!