2603.09309v1 Mar 10, 2026 cs.AI

신뢰 척도 재정의: 척도 설계가 LLM의 메타인지에 대해 밝혀주는 것

Rescaling Confidence: What Scale Design Reveals About LLM Metacognition

Citations: 10

h-index: 2

LLM이 수치적인 확신 점수를 보고하는 방식으로 사용되는 '언어화된 확신'은 블랙박스 환경에서 불확실성을 추정하는 데 널리 사용되지만, 확신 척도 자체(일반적으로 0~100)는 거의 검토되지 않습니다. 본 연구에서는 이러한 설계 선택이 중립적이지 않음을 보여줍니다. 6개의 LLM과 3개의 데이터셋을 사용하여 분석한 결과, 언어화된 확신은 심하게 이산화되어 있으며, 78% 이상의 응답이 단 3개의 둥근 숫자에 집중되는 경향을 보였습니다. 이러한 현상을 조사하기 위해, 본 연구는 확신 척도를 세 가지 차원(세분성, 경계 위치, 범위 규칙성)에서 체계적으로 조작하고, meta-d'를 사용하여 메타인지 민감도를 평가했습니다. 연구 결과, 0~20 척도는 표준 0~100 형식보다 일관되게 메타인지 효율성을 향상시키는 것으로 나타났습니다. 반면, 경계 압축은 성능을 저하시키고, 불규칙한 범위에서도 둥근 숫자에 대한 선호도가 지속되는 것을 확인했습니다. 이러한 결과는 확신 척도 설계가 언어화된 불확실성의 품질에 직접적인 영향을 미치며, LLM 평가에서 중요한 실험 변수로 고려되어야 함을 시사합니다.

Original Abstract

Verbalized confidence, in which LLMs report a numerical certainty score, is widely used to estimate uncertainty in black-box settings, yet the confidence scale itself (typically 0--100) is rarely examined. We show that this design choice is not neutral. Across six LLMs and three datasets, verbalized confidence is heavily discretized, with more than 78% of responses concentrating on just three round-number values. To investigate this phenomenon, we systematically manipulate confidence scales along three dimensions: granularity, boundary placement, and range regularity, and evaluate metacognitive sensitivity using meta-d'. We find that a 0--20 scale consistently improves metacognitive efficiency over the standard 0--100 format, while boundary compression degrades performance and round-number preferences persist even under irregular ranges. These results demonstrate that confidence scale design directly affects the quality of verbalized uncertainty and should be treated as a first-class experimental variable in LLM evaluation.

4 Citations

2 Influential

1 Altmetric

13.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!