2602.14080v1 Feb 15, 2026 cs.CL

빈 선반인가, 잃어버린 열쇠인가? 파라미터 기반 사실성 평가에서 회수가 병목 현상이다.

Empty Shelves or Lost Keys? Recall Is the Bottleneck for Parametric Factuality

Nitay Calderon

Technion

Citations: 378

h-index: 10

Eyal Ben-David

Citations: 3,224

h-index: 5

Zorik Gekhman

Google Research, Technion - Israel Institute of technology

Citations: 786

h-index: 8

E. Ofek

Citations: 34,334

h-index: 86

G. Yona

Citations: 4,381

h-index: 17

LLM의 표준 사실성 평가는 모든 오류를 동일하게 취급하여, 오류가 지식 부족(빈 선반)에서 발생하는지, 아니면 인코딩된 사실에 대한 접근성 제한(잃어버린 열쇠)에서 발생하는지 여부를 명확하게 밝히지 못합니다. 우리는 사실 수준에서 지식의 특성을 파악하는 행동 기반 프레임워크를 제안합니다. 각 사실은 인코딩 여부, 그리고 접근 가능성(불가능, 직접 회수 가능, 추론 시간 계산 필요)에 따라 분류됩니다. 이러한 프로파일링을 지원하기 위해, 웹 검색을 기반으로 LLM을 활용하여 자동화된 파이프라인으로 구축된 새로운 벤치마크인 WikiProfile을 소개합니다. 13개의 LLM에서 생성된 4백만 개의 응답을 분석한 결과, 최첨단 모델에서 벤치마크 사실의 인코딩이 거의 완료된 것으로 나타났습니다. GPT-5와 Gemini-3는 95~98%의 사실을 인코딩했습니다. 그러나 회수는 여전히 주요 병목 현상이며, 이전에 지식 부족으로 간주되었던 많은 오류는 사실에 대한 접근 실패에서 비롯됩니다. 이러한 실패는 체계적이며, 특히 덜 일반적인 사실과 역질문에 더 큰 영향을 미칩니다. 마지막으로, 추론을 통해 회수를 개선하고 상당수의 오류를 해결할 수 있으며, 이는 향후 발전이 단순히 모델 크기 확장보다는 모델이 이미 인코딩한 정보를 어떻게 활용하는지에 대한 개선에 더 크게 의존할 수 있음을 시사합니다.

Original Abstract

Standard factuality evaluations of LLMs treat all errors alike, obscuring whether failures arise from missing knowledge (empty shelves) or from limited access to encoded facts (lost keys). We propose a behavioral framework that profiles factual knowledge at the level of facts rather than questions, characterizing each fact by whether it is encoded, and then by how accessible it is: cannot be recalled, can be directly recalled, or can only be recalled with inference-time computation (thinking). To support such profiling, we introduce WikiProfile, a new benchmark constructed via an automated pipeline with a prompted LLM grounded in web search. Across 4 million responses from 13 LLMs, we find that encoding is nearly saturated in frontier models on our benchmark, with GPT-5 and Gemini-3 encoding 95--98% of facts. However, recall remains a major bottleneck: many errors previously attributed to missing knowledge instead stem from failures to access it. These failures are systematic and disproportionately affect long-tail facts and reverse questions. Finally, we show that thinking improves recall and can recover a substantial fraction of failures, indicating that future gains may rely less on scaling and more on methods that improve how models utilize what they already encode.

4 Citations

0 Influential

30 Altmetric

154.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!