2605.14322v1 May 14, 2026 cs.AI

에이전트가 교육을 할 준비가 되었는가? 실제 교육 워크플로우를 위한 다단계 벤치마크

Are Agents Ready to Teach? A Multi-Stage Benchmark for Real-World Teaching Workflows

Yujia Liu

Citations: 0

h-index: 0

Xiaodong Deng

Citations: 6,192

h-index: 5

Dayiheng Liu

Citations: 24,164

h-index: 24

Rui Sheng

Citations: 63

h-index: 6

Zixin Chen

Citations: 44

h-index: 3

Pengwei Liu

Citations: 45

h-index: 3

Haobo Li

Citations: 64

h-index: 5

Kashun Shum

Citations: 49

h-index: 3

Huamin Qu

Citations: 26

h-index: 2

언어 에이전트는 점점 더 복잡한 전문 워크플로우에 활용되고 있으며, 특히 튜터링은 기존 벤치마크에서 거의 측정되지 않는 중요한 능력입니다. 효과적인 튜터 에이전트는 단순히 정답을 생성하거나 정확한 도구 호출을 수행하는 것 이상을 요구합니다. 강력한 튜터는 학습자의 상태를 진단하고, 시간이 지남에 따라 지원을 조정하며, 교육적 근거에 기반한 교육적으로 타당한 결정을 내리고, 실제 학습 관리 시스템 내에서 개입을 수행해야 합니다. 우리는 EduAgentBench를 소개합니다. EduAgentBench는 튜터 에이전트의 교육 능력을 전체적으로 평가하기 위한, 근거 기반 벤치마크입니다. 이 벤치마크는 세 가지 핵심 역량 영역에 걸쳐 150개의 품질 관리된 작업으로 구성되어 있습니다. 즉, 전문적인 교육적 판단, 상황에 맞는 다중 턴 튜터링, 그리고 Canvas 스타일의 교육 워크플로우 완료입니다. 작업은 교육적 통찰력을 기반으로 구축되었으며, 상호 보완적인 검증 신호와 인간 검토를 통해 평가됩니다. 최첨단 모델에 대한 종합적인 평가 결과, 현재 모델은 제한적인 수준의 교육적 판단 능력을 갖추고 있지만, 상황에 맞는 튜터링 및 자율적인 교육 워크플로우 실행 측면에서는 여전히 전문적인 교육 기준에 미치지 못하는 것으로 나타났습니다. 현재까지 알려진 바로는, EduAgentBench는 튜터 에이전트의 전반적인 교육 능력을 평가하기 위한 최초의 이론적 근거를 갖춘 현실적인 벤치마크이며, 실제 교육 업무를 지원할 수 있는 미래의 튜터 에이전트 개발을 위한 측정 기반을 제공합니다.

Original Abstract

Language agents are increasingly deployed in complex professional workflows, with tutoring emerging as a particularly high-stakes capability that remains largely unmeasured in existing benchmarks. Effective tutor agents require more than producing correct answers or executing accurate tool calls: a robust tutor must diagnose learner state, adapt support over time, make pedagogically justified decisions grounded in educational evidence, and execute interventions within realistic learning-management systems. We introduce EduAgentBench, a source-grounded benchmark for holistically evaluating tutor agents across the full scope of teaching work. It contains 150 quality-controlled tasks across three capability surfaces: professional pedagogical judgment, situated multi-turn tutoring, and Canvas-style teaching workflow completion. Tasks are constructed through a pedagogical-insight-driven pipeline and evaluated with complementary verification signals and human review. Across a comprehensive evaluation of frontier models, our findings reveal that current models are generally capable of bounded pedagogical judgment, but still fall short of professional teaching standards in situated tutoring and autonomous teaching-workflow execution. To our knowledge, EduAgentBench is the first theory-grounded and realistic benchmark for evaluating the holistic teaching capability of tutor agents, providing a measurement foundation for developing future tutor agents that can support realistic teaching work.

0 Citations

0 Influential

12 Altmetric

60.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!