2602.12670v1 Feb 13, 2026 cs.AI

SkillsBench: 다양한 작업에서 에이전트 스킬의 성능 벤치마킹

SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks

Steven Dillmann

Citations: 391

h-index: 6

Wengao Ye

University of Oxford

Citations: 43

h-index: 2

Yubo Li

Citations: 119

h-index: 4

Xiangyi Li

Citations: 39

h-index: 1

Wenbo Chen

Citations: 40

h-index: 1

Yiming Liu

Citations: 68

h-index: 4

Shenghan Zheng

Citations: 39

h-index: 1

Yifeng He

University of California, Davis

Citations: 165

h-index: 7

B. You

Citations: 105

h-index: 5

Shuyi Wang

Citations: 56

h-index: 3

Di Wang

Citations: 100

h-index: 4

Xuandong Zhao

UC Berkeley

Citations: 3,118

h-index: 28

Roey Ben Chaim

Citations: 38

h-index: 1

Zonglin Di

Citations: 90

h-index: 4

Yi Gao

Citations: 3,976

h-index: 4

Junwei He

Citations: 61

h-index: 2

Yizhuo He

Citations: 41

h-index: 2

Liqiang Jing

Citations: 70

h-index: 4

Luyang Kong

Citations: 70

h-index: 4

Xin Lan

Citations: 109

h-index: 3

Jiachen Li

Citations: 48

h-index: 2

Songlin Li

Citations: 45

h-index: 2

Yijiang Li

Citations: 115

h-index: 6

Yue Lin

Citations: 44

h-index: 2

Xinyi Liu

Citations: 38

h-index: 1

Xuanqing Liu

Citations: 47

h-index: 2

H. Lyu

Citations: 58

h-index: 3

Zexiong Ma

Citations: 39

h-index: 1

Bowei Wang

Citations: 95

h-index: 4

Runhui Wang

Citations: 59

h-index: 3

Tianyu Wang

Citations: 131

h-index: 4

Yue Zhang

Citations: 40

h-index: 2

Hanwen Xing

Citations: 47

h-index: 2

Han Lee

Citations: 41

h-index: 2

Haotian Shen

Citations: 76

h-index: 4

Qunhong Zeng

Citations: 62

h-index: 4

Yuanli Wang

Citations: 42

h-index: 2

Y. Xue

Citations: 57

h-index: 3

Xiaokun Chen

Citations: 100

h-index: 3

Jiankai Sun

Citations: 3,871

h-index: 22

에이전트 스킬(Agent Skills)은 추론 시점에 LLM 에이전트의 기능을 향상시키는 절차적 지식의 구조화된 패키지입니다. 빠른 도입 속도에도 불구하고, 이것들이 실제로 도움이 되는지 측정할 표준화된 방법은 존재하지 않습니다. 본 논문에서는 11개 도메인에 걸친 86개 작업과 이에 상응하는 엄선된(curated) 스킬 및 결정론적 검증기로 구성된 벤치마크인 SkillsBench를 제시합니다. 각 작업은 스킬 없음, 엄선된 스킬, 자체 생성 스킬의 세 가지 조건 하에 평가됩니다. 우리는 7,308개의 궤적에 걸쳐 7가지 에이전트-모델 구성을 테스트했습니다. 엄선된 스킬은 평균 통과율을 16.2%포인트(pp) 상승시켰으나, 그 효과는 도메인에 따라 크게 달랐으며(소프트웨어 엔지니어링의 +4.5pp부터 헬스케어의 +51.9pp까지), 84개 작업 중 16개에서는 오히려 부정적인 변화를 보였습니다. 자체 생성 스킬은 평균적으로 이점을 제공하지 못했는데, 이는 모델이 소비함으로써 이득을 얻는 절차적 지식을 스스로 신뢰성 있게 저작할 수는 없음을 시사합니다. 2~3개의 모듈로 구성된 집중된 스킬이 포괄적인 문서보다 뛰어난 성능을 보였으며, 스킬을 갖춘 소형 모델이 스킬이 없는 대형 모델과 대등한 성능을 발휘할 수 있음이 확인되었습니다.

Original Abstract

Agent Skills are structured packages of procedural knowledge that augment LLM agents at inference time. Despite rapid adoption, there is no standard way to measure whether they actually help. We present SkillsBench, a benchmark of 86 tasks across 11 domains paired with curated Skills and deterministic verifiers. Each task is evaluated under three conditions: no Skills, curated Skills, and self-generated Skills. We test 7 agent-model configurations over 7,308 trajectories. Curated Skills raise average pass rate by 16.2 percentage points(pp), but effects vary widely by domain (+4.5pp for Software Engineering to +51.9pp for Healthcare) and 16 of 84 tasks show negative deltas. Self-generated Skills provide no benefit on average, showing that models cannot reliably author the procedural knowledge they benefit from consuming. Focused Skills with 2--3 modules outperform comprehensive documentation, and smaller models with Skills can match larger models without them.

40 Citations

10 Influential

14 Altmetric

130.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!