2601.21375v1 Jan 29, 2026 cs.AI

TeachBench: 대규모 언어 모델의 교육 능력을 평가하기 위한 실라버스 기반 프레임워크

TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models

Zhifang Sui

Citations: 407

h-index: 11

Zheng Li

Peking University

Citations: 80

h-index: 3

Siyao Song

Citations: 0

h-index: 0

Jingyuan Ma

Citations: 149

h-index: 5

Ying Zeng

Citations: 29

h-index: 2

Minghao Li

Citations: 37

h-index: 3

Rui Li

Citations: 85

h-index: 3

대규모 언어 모델(LLM)은 교육 조교로서의 가능성을 보여주지만, 그 교육 능력은 아직 충분히 평가되지 않았습니다. 기존 벤치마크는 주로 문제 해결이나 문제 수준의 지도에 초점을 맞추고 있어, 지식 중심의 교육 영역은 충분히 탐구되지 않았습니다. 우리는 멀티 턴(Multi-turn) 지도 후 학생의 성취도 향상을 통해 LLM의 교육 능력을 측정하는 실라버스 기반 평가 프레임워크를 제안합니다. 교사 에이전트를 구조화된 지식 요소와 예제 문제로 제한함으로써, 이 프레임워크는 정보 유출을 방지하고 기존 벤치마크의 재사용을 가능하게 합니다. 우리는 여러 과목에 걸친 가오카오(Gaokao) 데이터를 사용하여 이 프레임워크를 구현했습니다. 실험 결과, 모델과 도메인에 따라 교육 효과성에 상당한 차이가 있음이 드러났습니다. 일부 모델은 수학에서 우수한 성과를 보였으나, 물리학과 화학에서는 가르치는 데 여전히 어려움을 겪었습니다. 또한 예제 문제를 포함하는 것이 반드시 교육 능력을 향상시키는 것은 아니며, 모델들이 종종 예제에 국한된 오류 수정으로 치우치는 경향이 있음을 발견했습니다. 종합하면, 우리의 결과는 교육 능력이 LLM 행동의 구별되고 측정 가능한 차원임을 강조합니다.

Original Abstract

Large language models (LLMs) show promise as teaching assistants, yet their teaching capability remains insufficiently evaluated. Existing benchmarks mainly focus on problem-solving or problem-level guidance, leaving knowledge-centered teaching underexplored. We propose a syllabus-grounded evaluation framework that measures LLM teaching capability via student performance improvement after multi-turn instruction. By restricting teacher agents to structured knowledge points and example problems, the framework avoids information leakage and enables reuse of existing benchmarks. We instantiate the framework on Gaokao data across multiple subjects. Experiments reveal substantial variation in teaching effectiveness across models and domains: some models perform well in mathematics, while teaching remains challenging in physics and chemistry. We also find that incorporating example problems does not necessarily improve teaching, as models often shift toward example-specific error correction. Overall, our results highlight teaching ability as a distinct and measurable dimension of LLM behavior.

0 Citations

0 Influential

5.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!