2605.02661v1 May 04, 2026 cs.AI

AcademiClaw: 학생들이 AI 에이전트에게 도전 과제를 제시할 때

AcademiClaw: When Students Set Challenges for AI Agents

Yujia Liu

Citations: 0

h-index: 0

Weiye Si

Citations: 37

h-index: 3

Pengrui Lu

Citations: 280

h-index: 5

Ling Yang

Citations: 538

h-index: 8

Yukun Li

Citations: 63

h-index: 3

Xuan Yang

Citations: 272

h-index: 2

Yue Wu

Citations: 295

h-index: 6

Jiaxing Song

Citations: 315

h-index: 3

Yichi Zhang

Citations: 54

h-index: 2

Borui Zhang

Citations: 34

h-index: 3

Yi Yang

Citations: 45

h-index: 3

Yuchen Sun

Citations: 27

h-index: 3

Y. Wang

Citations: 518

h-index: 7

Junjie Yu

Citations: 25

h-index: 3

Jiabao Wu

Citations: 18

h-index: 1

Kaiwen Tao

Citations: 0

h-index: 0

Kun Wang

Citations: 645

h-index: 11

Xiuting Guo

Citations: 15

h-index: 2

Yanjie Wang

Citations: 1

h-index: 1

Zijian Hu

Citations: 3

h-index: 1

Bin Qiang

Citations: 26

h-index: 3

Chenning Li

Citations: 66

h-index: 4

Enchang Zhang

Citations: 7

h-index: 2

Fengjiao Jian

Citations: 2

h-index: 1

Hongyu Liu

Citations: 0

h-index: 0

Jia Deng

Citations: 296

h-index: 3

Jiaying Chi

Citations: 0

h-index: 0

Jiayou Shi

Citations: 20

h-index: 3

Jinghui Zhong

Citations: 7

h-index: 2

Jingyu Zhou

Citations: 7

h-index: 2

Jinze Li

Citations: 24

h-index: 3

Jun Yi

Citations: 2

h-index: 1

Jun-li Yu

Citations: 0

h-index: 0

Jun Xue

Citations: 0

h-index: 0

Nichelle Song

Citations: 0

h-index: 0

Pengyi Chen

Citations: 20

h-index: 3

Qi Chen

Citations: 8

h-index: 2

Rui Tao

Citations: 47

h-index: 2

Sheng Gong

Citations: 1

h-index: 1

Shen Lu

Citations: 4

h-index: 1

Tianqi Shen

Citations: 14

h-index: 1

Tianxiang Zhu

Citations: 10

h-index: 1

Tie-sheng Kang

Citations: 0

h-index: 0

Tingyu Li

Citations: 7

h-index: 2

Wen-Zheng Wu

Citations: 0

h-index: 0

Xiao Zhou

Citations: 678

h-index: 9

Xiaotao Zhang

Citations: 10

h-index: 2

Xun Zhang

Citations: 0

h-index: 0

Yan Li

Citations: 25

h-index: 3

Ye Lu

Citations: 14

h-index: 1

Yibo Zhou

Citations: 15

h-index: 2

Yihao Sun

Citations: 43

h-index: 4

Yijun Huang

Citations: 152

h-index: 8

Yu-Hsiang Sun

Citations: 2

h-index: 1

Yu-Hsuan Tu

Citations: 4

h-index: 1

Yuxuan Qin

Citations: 5

h-index: 1

Zeyu Li

Citations: 6

h-index: 2

Zhengyu Lou

Citations: 11

h-index: 2

Zhenning Ran

Citations: 0

h-index: 0

Zizhu He

Citations: 1

h-index: 1

Qiran Zhang

Citations: 4

h-index: 1

Xuan Wang

Citations: 9

h-index: 2

Yang Wang

Citations: 12

h-index: 2

Ziyi Yang

Citations: 136

h-index: 3

Zonghang Zhou

Citations: 21

h-index: 2

Feifan Chen

Citations: 3

h-index: 1

Feng Sun

Citations: 48

h-index: 4

Haoyang Zheng

Citations: 0

h-index: 0

Haoran Zhu

Citations: 16

h-index: 3

J. Fang

Citations: 8

h-index: 2

Quansheng Li

Citations: 11

h-index: 2

Xiaoyou Shen

Citations: 1

h-index: 1

Xinrong Li

Citations: 2

h-index: 1

Yixin Zhu

Citations: 19

h-index: 3

Yixuan Wu

Citations: 2

h-index: 1

Yuzhuo Wu

Citations: 6

h-index: 1

Pengfei Liu

Citations: 528

h-index: 7

OpenClaw 생태계 내의 기존 벤치마크는 주로 보조적인 작업 수준을 평가하는 데 초점을 맞추었으며, OpenClaw의 학술 수준의 기능은 상당 부분 검토되지 않았습니다. 본 연구에서는 AcademiClaw를 소개합니다. AcademiClaw는 80개의 복잡하고 장기적인 과제로 구성된 이중 언어 벤치마크이며, 이는 대학생들의 실제 학업 활동(과제, 연구 프로젝트, 경쟁, 개인 프로젝트)에서 직접 수집되었으며, 현재 AI 에이전트가 효과적으로 해결할 수 없다고 학생들이 판단한 것입니다. 230개의 학생 제출 후보 과제를 엄격한 전문가 검토를 거쳐 최종적으로 25개 이상의 전문 분야를 아우르는 과제 세트가 선정되었습니다. 이 세트에는 올림피아드 수준의 수학 및 언어 문제부터 GPU 집약적인 강화 학습, 풀스택 시스템 디버깅에 이르기까지 다양한 과제가 포함되어 있으며, 16개의 과제는 CUDA GPU 실행이 필요합니다. 각 과제는 격리된 Docker 환경에서 실행되며, 6가지 상호 보완적인 기술을 결합한 다차원 평가 기준에 따라 수행 결과가 평가됩니다. 또한, 독립적인 5가지 범주로 구성된 안전성 감사 시스템을 통해 추가적인 행동 분석을 수행합니다. 6개의 최첨단 모델에 대한 실험 결과, 가장 성능이 좋은 모델조차도 55%의 성공률을 기록하는 데 그쳤습니다. 추가 분석 결과, 과제 분야에 따른 명확한 능력 차이, 모델 간의 상이한 행동 전략, 그리고 토큰 소비량과 출력 품질 간의 불일치가 확인되었습니다. 이러한 분석은 단순한 집계 지표만으로는 파악하기 어려운 세부적인 진단 정보를 제공합니다. 저희는 AcademiClaw와 공개된 데이터 및 코드가 OpenClaw 커뮤니티에 유용한 자료가 되어 실제 학업 환경에서 더욱 능숙하고 다재다능한 에이전트 개발에 기여할 수 있기를 바랍니다. 모든 데이터와 코드는 https://github.com/GAIR-NLP/AcademiClaw 에서 확인할 수 있습니다.

Original Abstract

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

0 Citations

0 Influential

39.666066720281 Altmetric

198.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!