2606.05661v1 Jun 04, 2026 cs.AI

Continual Learning Bench: Evaluating Frontier AI Systems in Real-World Stateful Environments

Christopher Glaze
Christopher Glaze
Citations: 7
h-index: 1
Matei A. Zaharia
Matei A. Zaharia
Citations: 341
h-index: 7
Frederic Sala
Frederic Sala
Citations: 53
h-index: 4
Asim Biswal
Asim Biswal
Citations: 141
h-index: 4
R. Ramakrishnan
R. Ramakrishnan
Citations: 3
h-index: 1
Gabriel Orlanski
Gabriel Orlanski
New York University
Citations: 90
h-index: 5
Joseph E. Gonzalez
Joseph E. Gonzalez
Citations: 436
h-index: 3
Parth Asawa
Parth Asawa
Citations: 162
h-index: 7
Ben Xu
Ben Xu
Citations: 3
h-index: 1
V. Chen
V. Chen
Citations: 64
h-index: 3

Continual learning, the ability of AI systems to improve through sequential experience, has attracted substantial interest, but no high-quality benchmark exists to evaluate it. We introduce Continual Learning Bench (CL-Bench), the first difficult, expert-validated benchmark designed to measure whether LLM-based systems genuinely improve with experience. CL-Bench spans six diverse domains (software engineering, signal processing, disease outbreak forecasting, database querying, strategic game-playing, and demand forecasting), each validated by domain experts and designed so that tasks share a learnable latent structure (codebase layout, disease outbreak dynamics, opponent strategies) that a stateful system can discover online but a stateless one cannot. We evaluate frontier models across several agent architectures, from naive in-context learning (ICL) to dedicated memory systems, introducing a gain metric to isolate learning from prior capabilities. We find that these systems leave headroom for improved continual learning: agents frequently overfit to immediate observations or fail to reuse knowledge across instances, and dedicated memory systems do not fix this -- in fact, naive ICL outperforms systems dedicated to memory management. CL-Bench is the first benchmark to evaluate continual learning across diverse real-world domains with expert-validated tasks and isolate online learning from underlying model capability, showing a need for better continual learning systems.

0 Citations
0 Influential
3.5 Altmetric
17.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!