2602.13272v1 Feb 05, 2026 cs.AI

TemporalBench: 문맥 및 이벤트 기반 시계열 작업에서 LLM 기반 에이전트 평가를 위한 벤치마크

TemporalBench: A Benchmark for Evaluating LLM-Based Agents on Contextual and Event-Informed Time Series Tasks

Wei Yang

Citations: 68

h-index: 5

Muyan Weng

Citations: 25

h-index: 3

Defu Cao

Citations: 2,500

h-index: 14

Yash Sharma

Citations: 373

h-index: 5

Yan Liu

Citations: 14

h-index: 2

강력한 예측 성능이 진정한 시간적 이해를 반영하는지, 아니면 문맥 및 이벤트 기반 조건 하에서의 추론 능력을 나타내는지 불분명합니다. 본 논문에서는 점진적으로 풍부한 정보 환경에서 시간적 추론 행동을 평가하기 위해 설계된 다중 도메인 벤치마크인 TemporalBench를 소개합니다. TemporalBench는 역사적 구조 해석, 문맥 없는 예측, 문맥적 시간적 추론, 이벤트 조건부 예측이라는 4단계 작업 분류를 채택하며, 이를 통해 소매, 의료, 에너지, 물리 시스템 등 4가지 실제 도메인을 검토합니다. TemporalBench는 미래 목표 및 문맥 정보에 대한 접근을 제어함으로써, 모델이 시간적 패턴을 올바르게 해석하고, 이를 외부 문맥과 연관시키며, 조건이 변경될 때 예측을 조정할 수 있는지 진단 분석을 가능하게 합니다. 광범위한 기본 실험 결과, 강력한 수치적 예측 정확도가 반드시 견고한 문맥 또는 이벤트 인식 시간적 추론으로 이어지는 것은 아니라는 것을 보여줍니다. 오히려 기존 에이전트 프레임워크는 예측만을 평가하는 벤치마크에서는 가려져 있던 분산된 강점과 체계적인 실패 모드를 나타냅니다. TemporalBench 데이터 세트는 https://huggingface.co/datasets/Melady/TemporalBench 에서 공개적으로 사용할 수 있으며, 또한 https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard 에서 공개 리더보드를 제공합니다.

Original Abstract

It is unclear whether strong forecasting performance reflects genuine temporal understanding or the ability to reason under contextual and event-driven conditions. We introduce TemporalBench, a multi-domain benchmark designed to evaluate temporal reasoning behavior under progressively richer informational settings. TemporalBench adopts a four-tier task taxonomy that examines historical structure interpretation, context-free forecasting, contextual temporal reasoning, and event-conditioned prediction across four real-world domains: retail, healthcare, energy, and physical systems. By controlling access to future targets and contextual information, the benchmark enables a diagnostic analysis of whether models can correctly interpret temporal patterns, align them with external context, and adapt predictions when conditions change. Extensive baseline experiments show that strong numerical forecasting accuracy does not reliably translate into robust contextual or event-aware temporal reasoning; instead, existing agent frameworks exhibit fragmented strengths and systematic failure modes that remain largely hidden under forecasting-only benchmarks. The TemporalBench dataset is publicly available at https://huggingface.co/datasets/Melady/TemporalBench, and we additionally provide a public leaderboard at https://huggingface.co/spaces/Melady/TemporalBench_Leaderboard.

6 Citations

0 Influential

27 Altmetric

141.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!