2604.02118v1 Apr 02, 2026 cs.AI

시계열 데이터 기반 설명에 대한 LLM 기반 평가 모델

LLM-as-a-Judge for Time Series Explanations

Murari Mandal

Citations: 79

h-index: 5

Dhruv Kumar

Citations: 26

h-index: 2

Preetham Sivalingam

Citations: 0

h-index: 0

Saurabh Deshpande

Citations: 4

h-index: 1

시계열 데이터에 기반한 LLM이 생성한 자연어 설명의 사실 정확성을 평가하는 것은 여전히 해결해야 할 과제입니다. 최신 모델들은 숫자 신호에 대한 텍스트 해석을 생성하지만, 기존 평가 방법은 제한적입니다. 참조 기반 유사성 측정 방법 및 일관성 검사 모델은 정답 설명을 필요로 하는 반면, 기존 시계열 분석 방법은 순전히 숫자 값만을 사용하며 자유 형식 텍스트 기반 추론을 평가할 수 없습니다. 따라서, 미리 정의된 참조나 작업 특정 규칙 없이 시계열 데이터에 대한 설명이 얼마나 충실한지 직접적으로 검증할 수 있는 일반적인 방법은 존재하지 않습니다. 본 연구에서는 참조 없이 시계열 설명의 생성 및 평가를 위해 대규모 언어 모델을 활용합니다. 주어진 시계열 데이터, 질문, 그리고 후보 설명에 대해, 평가 모델은 패턴 식별, 숫자 정확성, 그리고 답변의 적합성을 기반으로 삼항(정확, 부분 정확, 부정확)으로 분류하여, 체계적인 점수 부여 및 비교를 가능하게 합니다. 이를 위해, 우리는 7가지 질문 유형에 걸쳐 350개의 시계열 사례를 포함하는 인공 벤치마크를 구축했으며, 각 사례는 정확한 설명, 부분적으로 정확한 설명, 그리고 부정확한 설명으로 구성됩니다. 우리는 4가지 작업(설명 생성, 상대적 순위 결정, 독립적인 점수 부여, 그리고 다중 이상 탐지)에 대한 모델 성능을 평가했습니다. 결과는 뚜렷한 비대칭성을 보여줍니다. 설명 생성은 질문 유형에 따라 패턴 의존성이 높으며, 특정 질문 유형에서 체계적인 오류를 보입니다. 예를 들어, 계절성 감소 및 변동성 변화에 대한 정확도는 0.00에서 0.12 사이인 반면, 구조적 변화에 대한 정확도는 0.94에서 0.96 사이입니다. 반면, 평가 모델은 자체 출력 결과가 부정확하더라도 설명을 정확하게 순위를 매기고 점수를 부여하는 등 비교적 안정적인 성능을 보입니다. 이러한 결과는 시계열 설명에 대한 데이터 기반 LLM 평가의 가능성을 보여주며, 시계열 도메인에서 데이터 기반 추론을 평가하는 데 있어 신뢰할 수 있는 평가 모델로서의 잠재력을 강조합니다.

Original Abstract

Evaluating factual correctness of LLM generated natural language explanations grounded in time series data remains an open challenge. Although modern models generate textual interpretations of numerical signals, existing evaluation methods are limited: reference based similarity metrics and consistency checking models require ground truth explanations, while traditional time series methods operate purely on numerical values and cannot assess free form textual reasoning. Thus, no general purpose method exists to directly verify whether an explanation is faithful to underlying time series data without predefined references or task specific rules. We study large language models as both generators and evaluators of time series explanations in a reference free setting, where given a time series, question, and candidate explanation, the evaluator assigns a ternary correctness label based on pattern identification, numeric accuracy, and answer faithfulness, enabling principled scoring and comparison. To support this, we construct a synthetic benchmark of 350 time series cases across seven query types, each paired with correct, partially correct, and incorrect explanations. We evaluate models across four tasks: explanation generation, relative ranking, independent scoring, and multi anomaly detection. Results show a clear asymmetry: generation is highly pattern dependent and exhibits systematic failures on certain query types, with accuracies ranging from 0.00 to 0.12 for Seasonal Drop and Volatility Shift, to 0.94 to 0.96 for Structural Break, while evaluation is more stable, with models correctly ranking and scoring explanations even when their own outputs are incorrect. These findings demonstrate feasibility of data grounded LLM based evaluation for time series explanations and highlight their potential as reliable evaluators of data grounded reasoning in the time series domain.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!