2604.17283v1 Apr 19, 2026 cs.CL

HorizonBench: 변화하는 선호도를 고려한 장기 개인화

HorizonBench: Long-Horizon Personalization with Evolving Preferences

Bhargavi Paranjape

Citations: 17,228

h-index: 14

Asli Celikyilmaz

Citations: 528

h-index: 11

L. Guan

Citations: 1,201

h-index: 11

S. Li

Citations: 809

h-index: 13

Lin Chen

Citations: 2,564

h-index: 9

Natalia Zhang

Citations: 19

h-index: 2

Zhongyao Ma

Citations: 10

h-index: 2

Ge Zhou

Citations: 63

h-index: 4

Sem Park

Citations: 121

h-index: 5

Yulia Tsvetkov

Citations: 21

h-index: 3

Kerem Oktar

Citations: 331

h-index: 8

Diyi Yang

Citations: 463

h-index: 4

사용자의 선호도는 수개월간의 상호작용을 통해 변화하며, 이러한 변화를 추적하려면 사용자가 명시한 선호도가 이후의 사건으로 인해 어떻게 변경되었는지 추론해야 합니다. 우리는 이 문제를 장기 개인화 문제로 정의하며, 이 분야의 발전은 데이터 가용성과 측정 측면에서 제한적이라는 것을 확인했습니다. 현재까지 자연스러운 장기 상호작용과 모델 실패 원인을 진단하는 데 필요한 정확한 정보(ground-truth provenance)를 모두 제공하는 자료는 존재하지 않습니다. 우리는 구조화된 정신 상태 그래프에서 대화를 생성하는 데이터 생성기를 도입하여, 6개월 동안의 모든 선호도 변화에 대한 정확한 정보를 제공합니다. 이를 바탕으로 HorizonBench를 구축했습니다. HorizonBench는 360명의 시뮬레이션된 사용자의 6개월 대화 기록을 포함하며, 평균적으로 약 4,300턴, 약 163,000개의 토큰으로 구성된 4,245개의 항목으로 이루어진 벤치마크입니다. HorizonBench는 장기 컨텍스트 모델링, 메모리 기반 아키텍처, 타 이론적 추론, 사용자 모델링을 위한 테스트 환경을 제공합니다. 25개의 최첨단 모델을 평가한 결과, 가장 성능이 좋은 모델은 52.8%의 정확도를 달성했지만, 대부분의 모델은 20%의 우연 확률 수준에 머물렀습니다. 이러한 모델들이 변화된 선호도에 대해 오류를 범할 때, 3분의 1 이상의 경우 사용자가 원래 명시한 값을 선택하며, 업데이트된 사용자 상태를 추적하지 못합니다. 이러한 상태 추적 실패는 컨텍스트 길이와 표현의 명확성 수준에 관계없이 지속적으로 발생하며, 이는 장기 개인화를 위한 주요 장애물로 작용합니다.

Original Abstract

User preferences evolve across months of interaction, and tracking them requires inferring when a stated preference has been changed by a subsequent life event. We define this problem as long-horizon personalization and observe that progress on it is limited by data availability and measurement, with no existing resource providing both naturalistic long-horizon interactions and the ground-truth provenance needed to diagnose why models fail. We introduce a data generator that produces conversations from a structured mental state graph, yielding ground-truth provenance for every preference change across 6-month timelines, and from it construct HorizonBench, a benchmark of 4,245 items from 360 simulated users with 6-month conversation histories averaging ~4,300 turns and ~163K tokens. HorizonBench provides a testbed for long-context modeling, memory-augmented architectures, theory-of-mind reasoning, and user modeling. Across 25 frontier models, the best model reaches 52.8% and most score at or below the 20% chance baseline. When these models err on evolved preferences, over a third of the time they select the user's originally stated value without tracking the updated user state. This belief-update failure persists across context lengths and expression explicitness levels, identifying state-tracking capability as the primary bottleneck for long-horizon personalization.

5 Citations

1 Influential

7 Altmetric

42.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!