2601.13717v1 Jan 20, 2026 cs.CL

시뮬레이션된 무지는 실패한다: 모델 지식 제한 시점을 기준으로 한 LLM 예측 문제에 대한 체계적인 연구

Simulated Ignorance Fails: A Systematic Study of LLM Behaviors on Forecasting Problems Before Model Knowledge Cutoff

Zehan Li

Citations: 5

h-index: 2

YuXuan Wang

Citations: 8

h-index: 2

Ali El Lahib

Citations: 7

h-index: 2

Ying-Jieh Xia

Citations: 3

h-index: 1

Xinyue Pi

Citations: 98

h-index: 6

LLM의 예측 능력을 평가하는 데는 근본적인 어려움이 있습니다. 미래 예측 평가는 방법론적 엄격성을 제공하지만, 평가 지연 시간이 지나치게 길다는 단점이 있습니다. 반면, 과거 사건에 대한 후퇴적 예측(RF)은 최첨단 모델들이 점점 더 최근의 지식 제한 시점을 갖게 됨에 따라, 깨끗한 평가 데이터가 빠르게 줄어드는 문제가 있습니다. 시뮬레이션된 무지(SI)는 모델이 제한 시점 이전의 지식을 억제하도록 유도하는 잠재적인 해결책으로 제시되었습니다. 본 연구에서는 SI가 진정한 무지(TI)를 얼마나 잘 근사하는지 체계적으로 평가했습니다. 477개의 경쟁 수준 질문과 9개의 모델을 대상으로 분석한 결과, SI는 다음과 같은 체계적인 실패를 보였습니다. (1) 제한 지시 사항은 SI와 TI 간의 52% 성능 격차를 초래합니다. (2) 체인 오브 소트(chain-of-thought) 추론은 제한 시점 이후의 명시적인 참조가 없더라도 이전 지식을 억제하는 데 실패합니다. (3) 추론 최적화 모델은 우수한 추론 품질에도 불구하고 SI 충실도가 더 낮습니다. 이러한 결과는 프롬프트가 모델의 지식을 신뢰성 있게 "되돌릴 수" 없음을 보여줍니다. 결론적으로, 제한 시점 이전의 사건에 대한 RF는 방법론적으로 결함이 있습니다. 따라서 예측 능력을 평가하기 위한 SI 기반의 후퇴적 평가 방법을 사용하지 않는 것을 권장합니다.

Original Abstract

Evaluating LLM forecasting capabilities is constrained by a fundamental tension: prospective evaluation offers methodological rigor but prohibitive latency, while retrospective forecasting (RF) -- evaluating on already-resolved events -- faces rapidly shrinking clean evaluation data as SOTA models possess increasingly recent knowledge cutoffs. Simulated Ignorance (SI), prompting models to suppress pre-cutoff knowledge, has emerged as a potential solution. We provide the first systematic test of whether SI can approximate True Ignorance (TI). Across 477 competition-level questions and 9 models, we find that SI fails systematically: (1) cutoff instructions leave a 52% performance gap between SI and TI; (2) chain-of-thought reasoning fails to suppress prior knowledge, even when reasoning traces contain no explicit post-cutoff references; (3) reasoning-optimized models exhibit worse SI fidelity despite superior reasoning trace quality. These findings demonstrate that prompts cannot reliably "rewind" model knowledge. We conclude that RF on pre-cutoff events is methodologically flawed; we recommend against using SI-based retrospective setups to benchmark forecasting capabilities.

3 Citations

0 Influential

3 Altmetric

18.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!