2603.19017v1 Mar 19, 2026 cs.CL

대규모 언어 모델에서 시간 추론을 실제로 제어하는 것은 무엇인가: 토큰화인가, 시간 표현인가?

What Really Controls Temporal Reasoning in Large Language Models: Tokenisation or Representation of Time?

Gagan Bhatia

Citations: 203

h-index: 7

Ahmad Muhammad Isa

Citations: 7

h-index: 1

Maxime Peyrard

Citations: 17

h-index: 3

Wei Zhao

Citations: 65

h-index: 2

본 연구에서는 다국어 시간 추론 벤치마크인 MultiTempBench를 제시합니다. MultiTempBench는 세 가지 작업(날짜 연산, 시간대 변환, 시간 관계 추출)을 포함하며, 영어, 독일어, 중국어, 아랍어, 하우사어의 다섯 개 언어와 여러 달력 체계(그레고리력, 율리안력, 중국 음력)를 포괄합니다. MultiTempBench는 750개의 선별된 영어 질문을 번역하고 각 질문을 제어된 날짜 형식 변형으로 확장하여 총 15,000개의 예시로 구성됩니다. 우리는 20개의 LLM을 평가하고, 인간의 심각도 평가를 기준으로 조정된 다국어 날짜 분할 비율(mDFR)과 내부 시간 표현에 대한 기하학적 탐색 분석을 소개합니다. 분석 결과, 시간 관련 요소의 토큰화 품질은 자원 의존적인 병목 현상이라는 것을 발견했습니다. 저자원 언어 및 드문 달력 형식에서는 분할이 연/월/일 분리를 방해하여 정확도가 저하되는 반면, 고자원 환경에서는 종종 숫자 수준의 분할에 강건한 모습을 보입니다. 토큰화 외에도 혼합 효과 회귀 분석 결과, 고자원 언어에서는 시간 선형성이 시간 추론의 가장 강력한 예측 변수인 반면, 저자원 언어에서는 분할이 더 강력한 예측 변수인 것으로 나타났습니다. 관련 코드는 다음 주소에서 확인할 수 있습니다: https://github.com/gagan3012/mtb

Original Abstract

We present MultiTempBench, a multilingual temporal reasoning benchmark spanning three tasks, date arithmetic, time zone conversion, and temporal relation extraction across five languages (English, German, Chinese, Arabic, and Hausa) and multiple calendar conventions (Gregorian, Hijri, and Chinese Lunar). MultiTempBench contains $15,000$ examples built by translating $750$ curated English questions and expanding each into controlled date-format variants. We evaluate 20 LLMs and introduce the multilingual Date Fragmentation Ratio (mDFR), calibrated with human severity ratings, together with geometric-probing analyses of internal temporal representations. We find tokenisation quality of temporal artefacts is a resource-dependent bottleneck: in low-resource languages and rarer calendar formats, fragmentation disrupts Year/Month/Day separation and accuracy collapses, while high-resource settings are often robust to digit-level splitting. Beyond tokenisation, crossed mixed-effects regression shows that temporal linearity is the strongest predictor of temporal reasoning in high-resource languages, whereas fragmentation is the stronger predictor in low-resource languages. Code is available at: https://github.com/gagan3012/mtb

0 Citations

0 Influential

23.5 Altmetric

117.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!