2604.09852v1 Apr 10, 2026 cs.AI

MEMENTO: LLM에게 자체 컨텍스트를 관리하는 방법을 가르치는 방법

MEMENTO: Teaching LLMs to Manage Their Own Context

John Langford

Citations: 683

h-index: 7

Yuchen Zeng

Citations: 59

h-index: 3

Ziyan Wang

KCL

Citations: 351

h-index: 9

Vasilis Kontonis

Citations: 744

h-index: 17

Shivam Garg

Citations: 99

h-index: 3

Lingjiao Chen

Citations: 176

h-index: 4

Ahmed Awadallah

Citations: 3,191

h-index: 11

Eric Horvitz

Citations: 1

h-index: 1

Dimitris Papailiopoulos

Citations: 165

h-index: 7

Hao Tang

Citations: 305

h-index: 9

기존의 추론 모델은 구조화되지 않은 긴 스트림으로 작동하며, 중간 상태를 압축하거나 정리할 수 있는 메커니즘이 없습니다. 본 논문에서는 MEMENTO라는 방법을 제안합니다. MEMENTO는 모델에게 추론을 블록으로 나누고, 각 블록을 'Memento'라고 하는 간결한 상태 요약으로 압축하도록 학습시키는 방법입니다. 모델은 Memento만 참조하여 추론을 진행함으로써 컨텍스트, KV 캐시 및 연산량을 줄입니다. MEMENTO 모델을 훈련하기 위해, OpenThoughts-v3에서 파생된 228,000개의 추론 데이터를 분할하고 중간 요약을 추가하여 공개 데이터셋인 OpenMementos를 공개합니다. OpenMementos를 사용한 두 단계의 SFT(Supervised Fine-Tuning) 방식이 Qwen3, Phi-4, Olmo 3 등 다양한 모델 아키텍처(8B~32B 파라미터)에서 효과적임을 확인했습니다. 훈련된 모델은 수학, 과학, 코딩 벤치마크에서 높은 정확도를 유지하면서 최대 약 2.5배의 KV 캐시 감소를 달성했습니다. 저희는 vLLM을 확장하여 MEMENTO의 추론 방법을 지원하고, 약 1.75배의 처리량 개선을 달성했으며, 이를 통해 강화 학습(RL)을 수행하여 정확도를 더욱 향상시켰습니다. 마지막으로, 각 추론 블록의 정보가 Memento 텍스트와 해당 KV 상태 모두에 의해 전달되는 '이중 정보 스트림'을 확인했습니다. 이 채널을 제거하면 AIME24 벤치마크에서 정확도가 약 15%p 감소했습니다.

Original Abstract

Reasoning models think in long, unstructured streams with no mechanism for compressing or organizing their own intermediate state. We introduce MEMENTO: a method that teaches models to segment reasoning into blocks, compress each block into a memento, i.e., a dense state summary, and reason forward by attending only to mementos, reducing context, KV cache, and compute. To train MEMENTO models, we release OpenMementos, a public dataset of 228K reasoning traces derived from OpenThoughts-v3, segmented and annotated with intermediate summaries. We show that a two-stage SFT recipe on OpenMementos is effective across different model families (Qwen3, Phi-4, Olmo 3) and scales (8B--32B parameters). Trained models maintain strong accuracy on math, science, and coding benchmarks while achieving ${\sim}2.5\times$ peak KV cache reduction. We extend vLLM to support our inference method, achieving ${\sim}1.75\times$ throughput improvement while also enabling us to perform RL and further improve accuracy. Finally, we identify a dual information stream: information from each reasoning block is carried both by the memento text and by the corresponding KV states, which retain implicit information from the original block. Removing this channel drops accuracy by 15\,pp on AIME24.

1 Citations

0 Influential

8.5 Altmetric

43.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!