2604.15859v1 Apr 17, 2026 cs.LG

QuantSightBench: 예측 구간을 활용한 LLM의 정량적 예측 평가

QuantSightBench: Evaluating LLM Quantitative Forecasting with Prediction Intervals

Citations: 110

h-index: 5

Citations: 25

h-index: 2

예측은 불확실성 하에서의 추론을 위한 자연스러운 벤치마크가 되었습니다. 그러나 현재까지의 대규모 언어 모델(LLM) 평가는 이분법 또는 객관식 질문과 같은 단순한 형식의 주관적인 작업에 국한되어 있습니다. 실제로 예측은 훨씬 더 광범위한 영역을 포괄합니다. 경제, 공중 보건, 사회 인구 통계 등 다양한 분야에서 의사 결정은 연속적인 양에 대한 수치적 추정에 의존하며, 이는 현재 벤치마크에서 제대로 반영되지 않습니다. 이러한 추정치를 평가하려면 불확실성을 명시적으로 드러내고 검증할 수 있는 형식이 필요합니다. 우리는 예측 구간을 이러한 목적에 적합한 자연스럽고 엄격한 인터페이스로 제안합니다. 예측 구간은 척도 인식, 신뢰 수준 간의 내부 일관성, 그리고 다양한 결과 범위를 통한 보정(calibration)을 요구하며, 이는 수치적 예측에 대한 점 추정 값보다 더 적합한 평가 형식입니다. 이러한 능력을 평가하기 위해, 새로운 벤치마크인 QuantSightBench를 소개하고, 다양한 환경에서 최첨단 모델을 평가하여, 실제 적용률(empirical coverage)과 구간의 정확도(interval sharpness)를 모두 평가합니다. 우리의 결과는 평가된 11개의 최첨단 및 공개 가중치 모델 중 어느 것도 90%의 적용률 목표를 달성하지 못했으며, 가장 높은 성능을 보인 모델인 Gemini 3.1 Pro (79.1%), Grok 4 (76.4%), GPT-5.4 (75.3%)는 모두 10% 이상의 격차를 보였습니다. 극단적인 값에서는 보정 성능이 크게 저하되었으며, 이는 평가된 모든 모델에서 일관적으로 과도한 자신감을 나타내는 것을 보여줍니다.

Original Abstract

Forecasting has become a natural benchmark for reasoning under uncertainty. Yet existing evaluations of large language models remain limited to judgmental tasks in simple formats, such as binary or multiple-choice questions. In practice, however, forecasting spans a far broader scope. Across domains such as economics, public health, and social demographics, decisions hinge on numerical estimates over continuous quantities, a capability that current benchmarks do not capture. Evaluating such estimates requires a format that makes uncertainty explicit and testable. We propose prediction intervals as a natural and rigorous interface for this purpose. They demand scale awareness, internal consistency across confidence levels, and calibration over a continuum of outcomes, making them a more suitable evaluation format than point estimates for numerical forecasting. To assess this capability, we introduce a new benchmark QuantSightBench, and evaluate frontier models under multiple settings, assessing both empirical coverage and interval sharpness. Our results show that none of the 11 evaluated frontier and open-weight models achieves the 90\% coverage target, with the top performers Gemini 3.1 Pro (79.1\%), Grok 4 (76.4\%), and GPT-5.4 (75.3\%) all falling at least 10 percentage points short. Calibration degrades sharply at extreme magnitudes, revealing systematic overconfidence across all evaluated models.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!