2604.08988v1 Apr 10, 2026 cs.AI

SEA-Eval: 에피소드 기반 평가를 넘어선 자가 진화 에이전트 평가를 위한 벤치마크

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

Jiaqing Liang

Citations: 19

h-index: 3

Yanghua Xiao

Citations: 400

h-index: 11

Tian Pan

Citations: 21

h-index: 2

Shisong Chen

Citations: 15

h-index: 2

Sihang Jiang

Citations: 359

h-index: 9

Lipeng Ma

Citations: 384

h-index: 8

Z. Hong

Citations: 122

h-index: 7

Zhiyu Lu

Citations: 6

h-index: 1

Jinghao Zhang

Citations: 7

h-index: 1

Keyi Wang

Fudan University

Citations: 6

h-index: 1

Weijia Zhou

Citations: 24

h-index: 3

현재의 LLM 기반 에이전트는 특정 작업 실행에서 뛰어난 성능을 보이지만, 정적인 도구 세트와 에피소드 기억 상실로 인해 경험을 축적하거나 작업 경계를 넘어 전략을 최적화하는 데 제한됩니다. 자가 진화 에이전트(SEA) 패러다임은 이전에 제안되었지만, 본 논문에서는 디지털 구현과 지속적인 작업 간 진화를 기반으로 한 SEA의 새로운 형식적 정의를 제시하고, SEA의 특징을 평가하기 위해 설계된 최초의 벤치마크인 SEA-Eval을 소개합니다. SEA-Eval은 작업 실행의 신뢰성과 장기적인 진화 성능이라는 두 가지 측면에서 SEA를 평가합니다. 본 논문에서는 작업을 순차적인 스트림으로 구성하고, 성공률과 토큰 소비량을 시간 경과에 따라 분석하여, 기존의 에피소드 기반 벤치마크로는 측정할 수 없는 진화적 이득과 구조적 안정성을 정량화합니다. 경험적 평가는 현재 최고 성능의 프레임워크에서 상당한 진화적 병목 현상이 존재한다는 것을 보여줍니다. 동일한 성공률에도 불구하고 토큰 소비량은 최대 31.2배 차이가 나며, 순차적 분석 결과에는 뚜렷한 진화적 경로 차이가 나타납니다. SEA-Eval은 에이전트를 단순한 작업 실행기로부터 진정으로 자가 진화하는 디지털 개체로 발전시키기 위한 엄격한 과학적 기반을 제공합니다.

Original Abstract

Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience or optimize strategies across task boundaries. While the Self-Evolving Agent (SEA) paradigm has been previously proposed, this paper contributes a new formal definition of SEA grounded in digital embodiment and continuous cross-task evolution, and introduces SEA-Eval, the first benchmark designed to evaluate SEA characteristics across two dimensions, intra-task execution reliability and long-term evolutionary performance. By organizing tasks into sequential streams and analyzing Success Rate and Token Consumption over time, SEA-Eval quantifies evolutionary gain and structural stability in ways that existing episodic benchmarks cannot. Empirical evaluations reveal a significant evolutionary bottleneck in current state-of-the-art frameworks, where identical success rates mask up to 31.2 times differences in token consumption and divergent evolutionary trajectories under sequential analysis. SEA-Eval provides a rigorous scientific foundation for advancing agents from mere task executors toward genuinely self-evolving digital entities.

1 Citations

0 Influential

5.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!