2606.09659v1 Jun 08, 2026 cs.CL

End-to-End Context Compression at Scale

Pavel Izmailov
Pavel Izmailov
Citations: 17
h-index: 2
Harshitha Menon
Harshitha Menon
Citations: 109
h-index: 2
Micah Goldblum
Micah Goldblum
Citations: 1,658
h-index: 20
Brian R. Bartoldson
Brian R. Bartoldson
Citations: 1,162
h-index: 14
B. Kailkhura
B. Kailkhura
Citations: 241
h-index: 5
Zhuang Liu
Zhuang Liu
Citations: 4
h-index: 1
Sean McLeish
Sean McLeish
Citations: 403
h-index: 6
Angela W. Li
Angela W. Li
Citations: 3
h-index: 1
Hao Chen
Hao Chen
Citations: 1
h-index: 1
Nimit Kalra
Nimit Kalra
Citations: 189
h-index: 4
Zaiqian Chen
Zaiqian Chen
Citations: 22
h-index: 2
Artem Gazizov
Artem Gazizov
Citations: 15
h-index: 2
Venkata Anoop Suhas Kumar Morisetty
Venkata Anoop Suhas Kumar Morisetty
Citations: 0
h-index: 0
Tom Goldstein
Tom Goldstein
Citations: 552
h-index: 4
Sanae Lotfi
Sanae Lotfi
Citations: 460
h-index: 8

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

0 Citations
0 Influential
10 Altmetric
50.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!