2604.06132v1 Apr 07, 2026 cs.AI

Claw-Eval: 자율 에이전트의 신뢰성 있는 평가를 향하여

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Tong Yang

Citations: 13

h-index: 2

Zhifang Sui

Citations: 312

h-index: 10

Hanglong Lv

Citations: 125

h-index: 3

Bowen Ye

Citations: 211

h-index: 5

Rang Li

Citations: 127

h-index: 4

Qibin Yang

Citations: 26

h-index: 2

Yuanxin Liu

Citations: 265

h-index: 7

Linli Yao

Citations: 19

h-index: 2

Chenxin An

Citations: 375

h-index: 6

Lingpeng Kong

Citations: 582

h-index: 10

Qi Liu

Citations: 530

h-index: 10

Lei Li

Citations: 3

h-index: 1

Zhihui Xie

Citations: 746

h-index: 10

최근 대규모 언어 모델은 실제 소프트웨어 환경에서 다단계 워크플로우를 실행하는 자율 에이전트로 점점 더 많이 사용되고 있습니다. 그러나 기존 에이전트 벤치마크는 세 가지 중요한 한계를 가지고 있습니다. (1) 최종 결과만 확인하는 경로 불투명성 평가, (2) 안전성 및 견고성 평가의 부족, (3) 제한적인 모달리티 지원 및 상호 작용 방식입니다. 우리는 이러한 세 가지 문제를 해결하기 위한 통합 평가 도구인 Claw-Eval을 소개합니다. Claw-Eval은 9개의 범주에 걸쳐 300개의 인간이 검증한 작업으로 구성되어 있으며, 세 가지 그룹(일반 서비스 오케스트레이션, 다중 모드 인식 및 생성, 다중 턴 전문 대화)을 포함합니다. 모든 에이전트 액션은 세 가지 독립적인 증거 채널(실행 추적, 감사 로그, 환경 스냅샷)을 통해 기록되며, 이를 통해 2,159개의 세분화된 평가 항목에 대한 경로 기반 평가가 가능합니다. 평가 프로토콜은 완수도, 안전성 및 견고성을 평가하며, 세 번의 시도에 걸쳐 평균 점수, Pass@k 및 Pass^k를 보고하여 실제 능력과 우연적인 결과를 구별합니다. 14개의 최첨단 모델에 대한 실험 결과, (1) 경로 불투명성 평가는 체계적으로 신뢰할 수 없으며, 우리의 하이브리드 파이프라인이 감지하는 안전 위반의 44%와 견고성 실패의 13%를 놓칩니다. (2) 제어된 오류 주입은 주로 최고 성능보다는 일관성에 영향을 미치며, Pass^3는 최대 24%까지 감소하는 반면 Pass@3는 안정적입니다. (3) 다중 모드 성능은 크게 다르며, 대부분의 모델은 문서나 이미지보다 비디오에서 성능이 저조하며, 어떤 모델도 모든 모달리티에서 우위를 점하지 않습니다. Claw-Eval은 벤치마킹뿐만 아니라, 에이전트 개발을 위한 실질적인 방향을 제시하며, 단순히 능력이 뛰어나기만 한 것이 아니라 안정적으로 배포될 수 있는 에이전트를 구축하는 데 필요한 사항을 보여줍니다.

Original Abstract

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

4 Citations

1 Influential

5 Altmetric

31.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!