2606.09659v1 Jun 08, 2026 cs.CL

End-to-End Context Compression at Scale

Pavel Izmailov

Citations: 17

h-index: 2

Harshitha Menon

Citations: 109

h-index: 2

Micah Goldblum

Citations: 1,658

h-index: 20

Brian R. Bartoldson

Citations: 1,162

h-index: 14

B. Kailkhura

Citations: 241

h-index: 5

Zhuang Liu

Citations: 4

h-index: 1

Sean McLeish

Citations: 403

h-index: 6

Angela W. Li

Citations: 3

h-index: 1

Hao Chen

Citations: 1

h-index: 1

Nimit Kalra

Citations: 189

h-index: 4

Zaiqian Chen

Citations: 22

h-index: 2

Artem Gazizov

Citations: 15

h-index: 2

Venkata Anoop Suhas Kumar Morisetty

Citations: 0

h-index: 0

Tom Goldstein

Citations: 552

h-index: 4

Sanae Lotfi

Citations: 460

h-index: 8

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!