2605.02661v1 May 04, 2026 cs.AI

AcademiClaw: 학생들이 AI 에이전트에게 도전 과제를 제시할 때

AcademiClaw: When Students Set Challenges for AI Agents

Yujia Liu
Yujia Liu
Citations: 0
h-index: 0
Weiye Si
Weiye Si
Citations: 37
h-index: 3
Pengrui Lu
Pengrui Lu
Citations: 280
h-index: 5
Ling Yang
Ling Yang
Citations: 538
h-index: 8
Yukun Li
Yukun Li
Citations: 63
h-index: 3
Xuan Yang
Xuan Yang
Citations: 272
h-index: 2
Yue Wu
Yue Wu
Citations: 295
h-index: 6
Jiaxing Song
Jiaxing Song
Citations: 315
h-index: 3
Yichi Zhang
Yichi Zhang
Citations: 54
h-index: 2
Borui Zhang
Borui Zhang
Citations: 34
h-index: 3
Yi Yang
Yi Yang
Citations: 45
h-index: 3
Yuchen Sun
Yuchen Sun
Citations: 27
h-index: 3
Y. Wang
Y. Wang
Citations: 518
h-index: 7
Junjie Yu
Junjie Yu
Citations: 25
h-index: 3
Jiabao Wu
Jiabao Wu
Citations: 18
h-index: 1
Kaiwen Tao
Kaiwen Tao
Citations: 0
h-index: 0
Kun Wang
Kun Wang
Citations: 645
h-index: 11
Xiuting Guo
Xiuting Guo
Citations: 15
h-index: 2
Yanjie Wang
Yanjie Wang
Citations: 1
h-index: 1
Zijian Hu
Zijian Hu
Citations: 3
h-index: 1
Bin Qiang
Bin Qiang
Citations: 26
h-index: 3
Chenning Li
Chenning Li
Citations: 66
h-index: 4
Enchang Zhang
Enchang Zhang
Citations: 7
h-index: 2
Fengjiao Jian
Fengjiao Jian
Citations: 2
h-index: 1
Hongyu Liu
Hongyu Liu
Citations: 0
h-index: 0
Jia Deng
Jia Deng
Citations: 296
h-index: 3
Jiaying Chi
Jiaying Chi
Citations: 0
h-index: 0
Jiayou Shi
Jiayou Shi
Citations: 20
h-index: 3
Jinghui Zhong
Jinghui Zhong
Citations: 7
h-index: 2
Jingyu Zhou
Jingyu Zhou
Citations: 7
h-index: 2
Jinze Li
Jinze Li
Citations: 24
h-index: 3
Jun Yi
Jun Yi
Citations: 2
h-index: 1
Jun-li Yu
Jun-li Yu
Citations: 0
h-index: 0
Jun Xue
Jun Xue
Citations: 0
h-index: 0
Nichelle Song
Nichelle Song
Citations: 0
h-index: 0
Pengyi Chen
Pengyi Chen
Citations: 20
h-index: 3
Qi Chen
Qi Chen
Citations: 8
h-index: 2
Rui Tao
Rui Tao
Citations: 47
h-index: 2
Sheng Gong
Sheng Gong
Citations: 1
h-index: 1
Shen Lu
Shen Lu
Citations: 4
h-index: 1
Tianqi Shen
Tianqi Shen
Citations: 14
h-index: 1
Tianxiang Zhu
Tianxiang Zhu
Citations: 10
h-index: 1
Tie-sheng Kang
Tie-sheng Kang
Citations: 0
h-index: 0
Tingyu Li
Tingyu Li
Citations: 7
h-index: 2
Wen-Zheng Wu
Wen-Zheng Wu
Citations: 0
h-index: 0
Xiao Zhou
Xiao Zhou
Citations: 678
h-index: 9
Xiaotao Zhang
Xiaotao Zhang
Citations: 10
h-index: 2
Xun Zhang
Xun Zhang
Citations: 0
h-index: 0
Yan Li
Yan Li
Citations: 25
h-index: 3
Ye Lu
Ye Lu
Citations: 14
h-index: 1
Yibo Zhou
Yibo Zhou
Citations: 15
h-index: 2
Yihao Sun
Yihao Sun
Citations: 43
h-index: 4
Yijun Huang
Yijun Huang
Citations: 152
h-index: 8
Yu-Hsiang Sun
Yu-Hsiang Sun
Citations: 2
h-index: 1
Yu-Hsuan Tu
Yu-Hsuan Tu
Citations: 4
h-index: 1
Yuxuan Qin
Yuxuan Qin
Citations: 5
h-index: 1
Zeyu Li
Zeyu Li
Citations: 6
h-index: 2
Zhengyu Lou
Zhengyu Lou
Citations: 11
h-index: 2
Zhenning Ran
Zhenning Ran
Citations: 0
h-index: 0
Zizhu He
Zizhu He
Citations: 1
h-index: 1
Qiran Zhang
Qiran Zhang
Citations: 4
h-index: 1
Xuan Wang
Xuan Wang
Citations: 9
h-index: 2
Yang Wang
Yang Wang
Citations: 12
h-index: 2
Ziyi Yang
Ziyi Yang
Citations: 136
h-index: 3
Zonghang Zhou
Zonghang Zhou
Citations: 21
h-index: 2
Feifan Chen
Feifan Chen
Citations: 3
h-index: 1
Feng Sun
Feng Sun
Citations: 48
h-index: 4
Haoyang Zheng
Haoyang Zheng
Citations: 0
h-index: 0
Haoran Zhu
Haoran Zhu
Citations: 16
h-index: 3
J. Fang
J. Fang
Citations: 8
h-index: 2
Quansheng Li
Quansheng Li
Citations: 11
h-index: 2
Xiaoyou Shen
Xiaoyou Shen
Citations: 1
h-index: 1
Xinrong Li
Xinrong Li
Citations: 2
h-index: 1
Yixin Zhu
Yixin Zhu
Citations: 19
h-index: 3
Yixuan Wu
Yixuan Wu
Citations: 2
h-index: 1
Yuzhuo Wu
Yuzhuo Wu
Citations: 6
h-index: 1
Pengfei Liu
Pengfei Liu
Citations: 528
h-index: 7

OpenClaw 생태계 내의 기존 벤치마크는 주로 보조적인 작업 수준을 평가하는 데 초점을 맞추었으며, OpenClaw의 학술 수준의 기능은 상당 부분 검토되지 않았습니다. 본 연구에서는 AcademiClaw를 소개합니다. AcademiClaw는 80개의 복잡하고 장기적인 과제로 구성된 이중 언어 벤치마크이며, 이는 대학생들의 실제 학업 활동(과제, 연구 프로젝트, 경쟁, 개인 프로젝트)에서 직접 수집되었으며, 현재 AI 에이전트가 효과적으로 해결할 수 없다고 학생들이 판단한 것입니다. 230개의 학생 제출 후보 과제를 엄격한 전문가 검토를 거쳐 최종적으로 25개 이상의 전문 분야를 아우르는 과제 세트가 선정되었습니다. 이 세트에는 올림피아드 수준의 수학 및 언어 문제부터 GPU 집약적인 강화 학습, 풀스택 시스템 디버깅에 이르기까지 다양한 과제가 포함되어 있으며, 16개의 과제는 CUDA GPU 실행이 필요합니다. 각 과제는 격리된 Docker 환경에서 실행되며, 6가지 상호 보완적인 기술을 결합한 다차원 평가 기준에 따라 수행 결과가 평가됩니다. 또한, 독립적인 5가지 범주로 구성된 안전성 감사 시스템을 통해 추가적인 행동 분석을 수행합니다. 6개의 최첨단 모델에 대한 실험 결과, 가장 성능이 좋은 모델조차도 55%의 성공률을 기록하는 데 그쳤습니다. 추가 분석 결과, 과제 분야에 따른 명확한 능력 차이, 모델 간의 상이한 행동 전략, 그리고 토큰 소비량과 출력 품질 간의 불일치가 확인되었습니다. 이러한 분석은 단순한 집계 지표만으로는 파악하기 어려운 세부적인 진단 정보를 제공합니다. 저희는 AcademiClaw와 공개된 데이터 및 코드가 OpenClaw 커뮤니티에 유용한 자료가 되어 실제 학업 환경에서 더욱 능숙하고 다재다능한 에이전트 개발에 기여할 수 있기를 바랍니다. 모든 데이터와 코드는 https://github.com/GAIR-NLP/AcademiClaw 에서 확인할 수 있습니다.

Original Abstract

Benchmarks within the OpenClaw ecosystem have thus far evaluated exclusively assistant-level tasks, leaving the academic-level capabilities of OpenClaw largely unexamined. We introduce AcademiClaw, a bilingual benchmark of 80 complex, long-horizon tasks sourced directly from university students' real academic workflows -- homework, research projects, competitions, and personal projects -- that they found current AI agents unable to solve effectively. Curated from 230 student-submitted candidates through rigorous expert review, the final task set spans 25+ professional domains, ranging from olympiad-level mathematics and linguistics problems to GPU-intensive reinforcement learning and full-stack system debugging, with 16 tasks requiring CUDA GPU execution. Each task executes in an isolated Docker sandbox and is scored on task completion by multi-dimensional rubrics combining six complementary techniques, with an independent five-category safety audit providing additional behavioral analysis. Experiments on six frontier models show that even the best achieves only a 55\% pass rate. Further analysis uncovers sharp capability boundaries across task domains, divergent behavioral strategies among models, and a disconnect between token consumption and output quality, providing fine-grained diagnostic signals beyond what aggregate metrics reveal. We hope that AcademiClaw and its open-sourced data and code can serve as a useful resource for the OpenClaw community, driving progress toward agents that are more capable and versatile across the full breadth of real-world academic demands. All data and code are available at https://github.com/GAIR-NLP/AcademiClaw.

0 Citations
0 Influential
39.666066720281 Altmetric
198.3 Score
Original PDF
16

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!