2603.01042v1 Mar 01, 2026 cs.CL

Thoth: 중간 학습을 통해 LLM을 시계열 데이터 이해에 연결

Thoth: Mid-Training Bridges LLMs to Time Series Understanding

Jialong Wu

Citations: 771

h-index: 11

Zhongyi Pei

Citations: 1,253

h-index: 4

Jianmin Wang

Citations: 331

h-index: 9

Jia-Chun Lin

Citations: 4

h-index: 1

Yuxuan Wang

Citations: 788

h-index: 6

Huakun Luo

Citations: 1,124

h-index: 5

대규모 언어 모델(LLM)은 일반적인 추론에서 놀라운 성공을 거두었지만, 여전히 시계열 데이터를 이해하고 추론하는 데 어려움을 겪으며, 이는 시간적 역학에 의존하는 의사 결정 시나리오에서 LLM의 효과를 제한합니다. 본 논문에서는 시계열 데이터에 대한 일반적인 이해 능력을 갖춘 최초의 중간 학습 LLM 패밀리인 Thoth를 제안합니다. 중요한 중간 단계로서, 중간 학습은 시계열과 자연어 간의 작업 및 도메인에 독립적인 정렬을 달성하며, 이를 위해 우리는 고품질의 시계열 중심 중간 학습 데이터셋인 Book-of-Thoth를 구축했습니다. Book-of-Thoth는 시계열-텍스트 및 텍스트-시계열 생성을 모두 가능하게 하여 LLM이 시간 패턴에 대한 기본적인 이해를 갖추도록 합니다. 고급 추론 능력을 보다 효과적으로 평가하기 위해, 우리는 시간 패턴과 도메인 지식에 대한 공동 추론을 위해 설계된 새로운 시계열 이해 벤치마크인 KnoTS를 추가로 제시합니다. 광범위한 실험 결과, Book-of-Thoth를 사용한 중간 학습은 Thoth가 다양한 시계열 질의 응답 벤치마크에서 기본 모델 및 고급 LLM보다 훨씬 뛰어난 성능을 보이도록 합니다. 또한, Thoth는 데이터 부족 상황에서 미세 조정 시에도 우수한 성능을 보이며, 이는 시계열 이해를 위한 중간 학습의 효과를 강조합니다. 관련 코드는 다음 주소에서 확인할 수 있습니다: https://github.com/thuml/Thoth.

Original Abstract

Large Language Models (LLMs) have demonstrated remarkable success in general-purpose reasoning. However, they still struggle to understand and reason about time series data, which limits their effectiveness in decision-making scenarios that depend on temporal dynamics. In this paper, we propose Thoth, the first family of mid-trained LLMs with general-purpose time series understanding capabilities. As a pivotal intermediate stage, mid-training achieves task- and domain-agnostic alignment between time series and natural language, for which we construct Book-of-Thoth, a high-quality, time-series-centric mid-training corpus. Book-of-Thoth enables both time-series-to-text and text-to-time-series generation, equipping LLMs with a foundational grasp of temporal patterns. To better evaluate advanced reasoning capabilities, we further present KnoTS, a novel benchmark of knowledge-intensive time series understanding, designed for joint reasoning over temporal patterns and domain knowledge. Extensive experiments demonstrate that mid-training with Book-of-Thoth enables Thoth to significantly outperform its base model and advanced LLMs across a range of time series question answering benchmarks. Moreover, Thoth exhibits superior capabilities when fine-tuned under data scarcity, underscoring the effectiveness of mid-training for time series understanding. Code is available at: https://github.com/thuml/Thoth.

1 Citations

0 Influential

33.547189562171 Altmetric

168.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!