2606.05661v1 Jun 04, 2026 cs.AI

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

Christopher Glaze

Citations: 7

h-index: 1

Matei A. Zaharia

Citations: 341

h-index: 7

Frederic Sala

Citations: 53

h-index: 4

Asim Biswal

Citations: 141

h-index: 4

R. Ramakrishnan

Citations: 3

h-index: 1

Gabriel Orlanski

New York University

Citations: 90

h-index: 5

Joseph E. Gonzalez

Citations: 436

h-index: 3

Parth Asawa

Citations: 162

h-index: 7

Ben Xu

Citations: 3

h-index: 1

V. Chen

Citations: 64

h-index: 3

Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!