2604.10291v1 Apr 11, 2026 cs.AI

TimeSeriesExamAgent: 대규모 시계열 추론 벤치마크 생성

TimeSeriesExamAgent: Creating Time Series Reasoning Benchmarks at Scale

Yifu Cai

Citations: 466

h-index: 4

Malgorzata Gwiazda

Citations: 0

h-index: 0

Mononito Goswami

Carnegie Mellon University

Citations: 802

h-index: 12

Artur Dubrawski

Citations: 488

h-index: 5

대규모 언어 모델(LLM)은 시계열 모델링 작업에서 유망한 성능을 보여주었지만, 과연 시계열 데이터를 진정으로 이해하고 있을까요? 이 근본적인 질문에 답하기 위한 여러 벤치마크가 제안되었지만, 대부분은 수동으로 구성되며 특정 영역이나 특정 기술 세트에 초점을 맞추고 있습니다. 이러한 제한 사항을 해결하기 위해, 우리는 LLM 에이전트의 창의성과 템플릿의 유연성을 결합하여 포괄적인 시계열 추론 벤치마크를 생성하는 확장 가능한 방법을 제안합니다. 먼저, 우리는 LLM을 평가하기 위한 객관식 벤치마크인 TimeSeriesExam을 개발하며, 여기에는 패턴 인식, 노이즈 이해, 유사성 분석, 이상 탐지 및 인과 관계의 다섯 가지 핵심 추론 범주가 포함됩니다. 그런 다음, TimeSeriesExamAgent를 통해 우리의 접근 방식을 확장하여 실제 데이터 세트(의료, 금융, 기상 분야)에서 자동으로 벤치마크를 생성합니다. 다차원 품질 평가를 통해, 우리의 자동으로 생성된 벤치마크가 수동으로 구성된 대안과 비교할 만한 다양성을 달성한다는 것을 보여줍니다. 그러나, 우리의 실험 결과는 LLM의 성능이 추상적인 시계열 추론과 도메인별 응용 분야에서 여전히 제한적이며, 이는 이러한 모델에서 효과적인 시계열 이해를 가능하게 하는 데 여전히 해결해야 할 과제가 많다는 것을 보여줍니다. TimeSeriesExamAgent는 https://github.com/magwiazda/TimeSeriesExamAgent 에서 사용할 수 있습니다.

Original Abstract

Large Language Models (LLMs) have shown promising performance in time series modeling tasks, but do they truly understand time series data? While multiple benchmarks have been proposed to answer this fundamental question, most are manually curated and focus on narrow domains or specific skill sets. To address this limitation, we propose scalable methods for creating comprehensive time series reasoning benchmarks that combine the flexibility of templates with the creativity of LLM agents. We first develop TimeSeriesExam, a multiple-choice benchmark using synthetic time series to evaluate LLMs across five core reasoning categories: pattern recognitionnoise understandingsimilarity analysisanomaly detection, and causality. Then, with TimeSeriesExamAgent, we scale our approach by automatically generating benchmarks from real-world datasets spanning healthcare, finance and weather domains. Through multi-dimensional quality evaluation, we demonstrate that our automatically generated benchmarks achieve diversity comparable to manually curated alternatives. However, our experiments reveal that LLM performance remains limited in both abstract time series reasoning and domain-specific applications, highlighting ongoing challenges in enabling effective time series understanding in these models. TimeSeriesExamAgent is available at https://github.com/magwiazda/TimeSeriesExamAgent.

1 Citations

0 Influential

34.95879734614 Altmetric

175.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!