2601.23204v1 Jan 30, 2026 cs.AI

TSAQA: 시계열 분석 질의응답 벤치마크

TSAQA: Time Series Analysis Question And Answering Benchmark

Zhining Liu

Citations: 388

h-index: 12

Jiaru Zou

Univeristy of Illinois Urbana Champaign

Citations: 390

h-index: 11

Ruizhong Qiu

Citations: 853

h-index: 16

Xiao Lin

University of Illinois Urbana-Champaign

Citations: 294

h-index: 10

Dongqi Fu

Citations: 793

h-index: 18

Tianxin Wei

Citations: 401

h-index: 13

Hanghang Tong

Citations: 44

h-index: 4

Zhichen Zeng

University of Illinois Urbana-Champaign

Citations: 669

h-index: 16

Yuchen Yan

Citations: 287

h-index: 9

Jingrui He

Citations: 213

h-index: 9

Baoyu Jing

Citations: 44

h-index: 3

Sanhorn Chen

Citations: 5

h-index: 1

Lecheng Zheng

Citations: 650

h-index: 14

Boyu Liu

Citations: 39

h-index: 4

Zihao Li

University of Illinois Urbana-Champaign

Citations: 316

h-index: 11

Jingchao Ni

Citations: 3,211

h-index: 26

시계열 데이터는 금융, 의료, 교통, 환경 과학 등 다양한 분야의 중요한 응용 분야에 필수적입니다. 최근에는 다중 작업 시계열 질의응답(QA) 연구가 시작되었지만, 현재 벤치마크는 주로 예측 및 이상 탐지 작업에 국한되어 있습니다. 본 연구에서는 작업 범위를 확장하고 다양한 시간 분석 능력을 평가하기 위한 새로운 통합 벤치마크인 TSAQA를 소개합니다. TSAQA는 기존 분석(이상 탐지, 분류 등)부터 특성 분석, 비교, 데이터 변환, 시간 관계 분석과 같은 고급 분석까지, 총 6가지 다양한 작업을 하나의 프레임워크로 통합합니다. 13개 분야에 걸쳐 총 21만 개의 샘플로 구성된 데이터셋은 참/거짓(TF), 객관식(MC), 그리고 새로운 유형의 퍼즐(PZ) 형식을 포함하여 시계열 분석을 종합적으로 평가합니다. 제로샷 평가 결과, 현재의 대규모 언어 모델(LLM)에게 이러한 작업은 여전히 어려운 것으로 나타났습니다. 가장 성능이 좋은 상용 LLM인 Gemini-2.5-Flash는 평균 65.08점의 낮은 점수를 기록했습니다. 지시 튜닝은 오픈 소스 모델의 성능을 향상시키지만, 가장 성능이 좋은 오픈 소스 모델인 LLaMA-3.1-8B도 개선의 여지가 많으며, 이는 LLM에게 시계열 분석이 복잡한 작업임을 시사합니다.

Original Abstract

Time series data are integral to critical applications across domains such as finance, healthcare, transportation, and environmental science. While recent work has begun to explore multi-task time series question answering (QA), current benchmarks remain limited to forecasting and anomaly detection tasks. We introduce TSAQA, a novel unified benchmark designed to broaden task coverage and evaluate diverse temporal analysis capabilities. TSAQA integrates six diverse tasks under a single framework ranging from conventional analysis, including anomaly detection and classification, to advanced analysis, such as characterization, comparison, data transformation, and temporal relationship analysis. Spanning 210k samples across 13 domains, the dataset employs diverse formats, including true-or-false (TF), multiple-choice (MC), and a novel puzzling (PZ), to comprehensively assess time series analysis. Zero-shot evaluation demonstrates that these tasks are challenging for current Large Language Models (LLMs): the best-performing commercial LLM, Gemini-2.5-Flash, achieves an average score of only 65.08. Although instruction tuning boosts open-source performance: the best-performing open-source model, LLaMA-3.1-8B, shows significant room for improvement, highlighting the complexity of temporal analysis for LLMs.

4 Citations

0 Influential

13 Altmetric

69.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!