2604.17308v1 Apr 19, 2026 cs.AI

SkillFlow: 자율 에이전트의 평생 기술 습득 및 진화 벤치마킹

SkillFlow:Benchmarking Lifelong Skill Discovery and Evolution for Autonomous Agents

Haibo Qiu

Citations: 129

h-index: 6

Shiting Huang

Citations: 52

h-index: 5

Yu Zeng

Citations: 172

h-index: 8

Qingnan Ren

Citations: 5

h-index: 1

Qisheng Su

Citations: 41

h-index: 3

Lin Chen

Citations: 2,336

h-index: 8

Zehui Chen

Citations: 1,770

h-index: 9

Wenxuan Huang

Citations: 57

h-index: 6

Yiming Zhao

Citations: 60

h-index: 5

Ziao Zhang

Citations: 225

h-index: 8

Kou Shi

Citations: 0

h-index: 0

Avery Nie

Citations: 0

h-index: 0

Wei Yang

Citations: 10

h-index: 1

Shu Zou

Citations: 106

h-index: 5

Feng Zhao

Citations: 2,050

h-index: 6

Zhenlong Fang

Citations: 36

h-index: 3

자율 에이전트의 기능 발전이 지속적으로 확장됨에 따라, 이들은 플러그 앤 플레이 방식으로 외부 기술을 활용하여 전문적인 작업을 수행할 수 있게 되었습니다. 그러나 현재의 벤치마크는 대부분 모델이 제공된 기술을 사용할 수 있는지 여부를 테스트하는 데 초점을 맞추고 있으며, 에이전트가 경험을 통해 기술을 발견하고, 실패 후 이를 복구하며, 시간이 지남에 따라 일관된 기술 라이브러리를 유지할 수 있는지에 대한 질문에는 답하지 못합니다. 본 연구에서는 20개의 범주에 걸쳐 166개의 작업으로 구성된 벤치마크인 SkillFlow를 소개합니다. 각 범주 내의 작업은 도메인에 독립적인 실행 흐름(DAEF)을 따르며, 이는 에이전트 워크플로우 프레임워크를 정의하여 이러한 작업들이 일관된 워크플로우를 공유하도록 합니다. 에이전트는 에이전트 평생 학습 프로토콜 하에서 평가됩니다. 에이전트는 기술 없이 시작하여 각 범주 내에서 작업을 순차적으로 해결하고, 경로 및 평가 기준 기반의 기술 업데이트를 통해 얻은 교훈을 외부화하며, 업데이트된 라이브러리를 계속 활용합니다. 실험 결과, 상당한 성능 격차가 존재하는 것으로 나타났습니다. Claude Opus 4.6의 경우, 평생 기술 진화는 작업 성공률을 62.65%에서 71.08%로 향상시켰습니다 (+8.43 포인트). 그러나 높은 기술 사용량이 반드시 높은 유용성을 의미하는 것은 아닙니다. Kimi K2.5는 66.87%의 기술 사용률에도 불구하고 0.60 포인트의 개선만 보였으며, Qwen-Coder-Next는 44.58%의 작업 완료율에 그치며, 초기 설정에 비해 오히려 성능이 저하되었습니다. SkillFlow는 이 분야를 위한 체계적인 테스트 환경을 제공하며, 평생 평가 하에서 기술 발견, 패치, 전송 및 실패 모드에 대한 심층적인 경험적 분석을 제공합니다.

Original Abstract

As the capability frontier of autonomous agents continues to expand, they are increasingly able to complete specialized tasks through plug-and-play external skills. Yet current benchmarks mostly test whether models can use provided skills, leaving open whether they can discover skills from experience, repair them after failure, and maintain a coherent library over time. We introduce SkillFlow, a benchmark of 166 tasks across 20 families in which task construction within each family follows a Domain-Agnostic Execution Flow (DAEF) that defines an agent workflow framework, allowing these tasks to share a consistent workflow. Agents are evaluated under an Agentic Lifelong Learning protocol in which they begin without skills, solve tasks sequentially within each family, externalize lessons through trajectory- and rubric-driven skill patches, and carry the updated library forward. Experiments reveal a substantial capability gap. For Claude Opus 4.6, lifelong skill evolution improves task success from 62.65% to 71.08% (+8.43 points). However, high skill usage does not necessarily imply high utility: Kimi K2.5 gains only +0.60 points despite 66.87% skill usage, while Qwen-Coder-Next reaches only a 44.58% task completion rate and still regresses relative to the vanilla setting. SkillFlow contributes a structured testbed for this direction and an in-depth empirical analysis of skill discovery, patching, transfer, and their failure modes under lifelong evaluation.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!