2601.07606v1 Jan 12, 2026 cs.CL

시간 증명(Proof of Time): 과학적 아이디어 평가 모델의 성능을 측정하기 위한 벤치마크

Proof of Time: A Benchmark for Evaluating Scientific Idea Judgments

Zidi Xiong

Citations: 79

h-index: 4

Shan Chen

Citations: 37

h-index: 3

Bingyang Ye

Citations: 2

h-index: 1

Jingxuan Tu

Citations: 10

h-index: 2

Chen Liu

Yale University

Citations: 221

h-index: 10

S. Schmidgall

Citations: 436

h-index: 7

D. Bitterman

Citations: 891

h-index: 13

최근 대규모 언어 모델은 연구 아이디어를 평가하고 예측하는 데 점점 더 많이 사용되고 있지만, 이러한 모델이 내리는 과학적 아이디어 평가의 품질을 평가할 수 있는 확장 가능한 방법은 부족합니다. 이러한 목표를 달성하기 위해, 우리는 PoT(Proof of Time)라는 반(semi-)검증 가능한 벤치마크 프레임워크를 소개합니다. PoT는 과학적 아이디어 평가를 이후에 관찰될 수 있는 결과 지표(예: 인용 횟수, 연구자들의 연구 방향 변화)와 연결합니다. PoT는 오프라인 샌드박스에서 특정 시점 이전의 증거를 고정하고, 모델에게 해당 시점 이후의 결과를 예측하도록 하여, 실제 결과가 나올 때 검증 가능한 평가를 가능하게 합니다. 또한, PoT는 광범위한 전문가 주석 없이도 확장 가능한 벤치마킹을 지원하며, 동료 평가 수상과 같은 지표를 기준으로 인간과 모델의 의견 불일치를 분석할 수 있습니다. 또한, PoT는 프롬프트 제거 및 예산 조절을 통해 도구를 사용하는 에이전트와 도구를 사용하지 않는 기준 모델을 비교하는, 에이전트 기반의 연구 아이디어 평가를 위한 제어된 테스트 환경을 제공합니다. 30,000건 이상의 데이터, 즉 4가지 벤치마크 도메인을 대상으로 실험한 결과, 일반적으로 도구를 사용하는 에이전트는 도구를 사용하지 않는 기준 모델보다 높은 상호 작용 예산이 주어질 때 성능이 향상되는 경향이 있으며, 도구 사용의 효과는 특정 작업에 따라 크게 달라지는 것으로 나타났습니다. PoT는 시간 분할을 통해 미래에 검증 가능한 목표를 설정하고, 도구 사용을 위한 오프라인 샌드박스를 제공함으로써, 미래 지향적인 과학적 아이디어 평가 작업에서 에이전트의 성능을 확장 가능하게 평가할 수 있도록 지원합니다.

Original Abstract

Large language models are increasingly being used to assess and forecast research ideas, yet we lack scalable ways to evaluate the quality of models' judgments about these scientific ideas. Towards this goal, we introduce PoT, a semi-verifiable benchmarking framework that links scientific idea judgments to downstream signals that become observable later (e.g., citations and shifts in researchers' agendas). PoT freezes a pre-cutoff snapshot of evidence in an offline sandbox and asks models to forecast post-cutoff outcomes, enabling verifiable evaluation when ground truth arrives, scalable benchmarking without exhaustive expert annotation, and analysis of human-model misalignment against signals such as peer-review awards. In addition, PoT provides a controlled testbed for agent-based research judgments that evaluate scientific ideas, comparing tool-using agents to non-agent baselines under prompt ablations and budget scaling. Across 30,000+ instances spanning four benchmark domains, we find that, compared with non-agent baselines, higher interaction budgets generally improve agent performance, while the benefit of tool use is strongly task-dependent. By combining time-partitioned, future-verifiable targets with an offline sandbox for tool use, PoT supports scalable evaluation of agents on future-facing scientific idea judgment tasks.

1 Citations

0 Influential

6.5 Altmetric

33.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!