2602.01313v2 Feb 01, 2026 cs.CL

EverMemBench: 대규모 언어 모델의 장기 상호 작용 메모리 벤치마킹

EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models

Chuanrui Hu

Citations: 32

h-index: 3

Xingze Gao

Citations: 27

h-index: 2

Dannong Xu

Citations: 27

h-index: 2

Yi Bai

Citations: 29

h-index: 2

Tong Li

Citations: 65

h-index: 4

Yafeng Deng

Citations: 35

h-index: 3

Jian Pei

Citations: 328

h-index: 5

Hongda Chen

Citations: 35

h-index: 3

Tianwei Lin

Citations: 5

h-index: 2

Xiaohong Li

Citations: 7

h-index: 2

Yunyun Han

Citations: 11

h-index: 2

장기적인 대화 메모리는 LLM 기반 어시스턴트에게 필수적이지만, 기존 벤치마크는 쌍방향, 단일 주제 대화에 초점을 맞춰 실제 세계의 복잡성을 제대로 반영하지 못합니다. 본 논문에서는 다자, 다그룹 대화를 포함하며, 1백만 토큰 이상의 텍스트를 활용하고 시간적으로 변화하는 정보, 교차 주제 연관성, 그리고 역할별 페르소나를 특징으로 하는 벤치마크인 EverMemBench를 소개합니다. EverMemBench는 1,000개 이상의 질의응답 쌍을 통해 세 가지 측면에서 메모리 시스템을 평가합니다: 세분화된 기억력, 메모리 인식, 그리고 사용자 프로필 이해. 우리의 평가는 다음과 같은 중요한 한계점을 드러냅니다: (1) 다중 홉 추론은 다자 환경에서 어려움을 겪으며, 심지어 최적 모델조차 26%의 정확도에 그칩니다; (2) 시간 추론은 여전히 해결되지 않은 문제이며, 타임스탬프 매칭 이상의 버전 의미론이 필요합니다; (3) 메모리 인식은 검색에 의해 제한되며, 현재의 유사성 기반 방법은 쿼리와 암묵적으로 관련된 메모리 간의 의미 격차를 해소하는 데 실패합니다. EverMemBench는 차세대 메모리 아키텍처를 개발하기 위한 도전적인 테스트 환경을 제공합니다.

Original Abstract

Long-term conversational memory is essential for LLM-based assistants, yet existing benchmarks focus on dyadic, single-topic dialogues that fail to capture real-world complexity. We introduce EverMemBench, a benchmark featuring multi-party, multi-group conversations spanning over 1 million tokens with temporally evolving information, cross-topic interleaving, and role-specific personas. EverMemBench evaluates memory systems across three dimensions through 1,000+ QA pairs: fine-grained recall, memory awareness, and user profile understanding. Our evaluation reveals critical limitations: (1) multi-hop reasoning collapses in multi-party settings, with even oracle models achieving only 26%; (2) temporal reasoning remains unsolved, requiring version semantics beyond timestamp matching; (3) memory awareness is bottlenecked by retrieval, where current similarity-based methods fail to bridge the semantic gap between queries and implicitly relevant memories. EverMemBench provides a challenging testbed for developing next-generation memory architectures.

2 Citations

0 Influential

2.5 Altmetric

14.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!