2602.18466v1 Feb 08, 2026 cs.CY

다중 모드 LLM은 과학 교육을 이해할 수 있는가? 초중등학교 교실 영상 데이터를 활용한 교육적 추론 성능 평가

Can Multimodal LLMs See Science Instruction? Benchmarking Pedagogical Reasoning in K-12 Classroom Videos

Peng He

Citations: 14

h-index: 2

Yixuan Shen

Citations: 17

h-index: 2

Honglu Liu

Citations: 0

h-index: 0

Tingting Li

Citations: 14

h-index: 2

Kaidi Xu

Citations: 660

h-index: 9

Feng Liu

Citations: 39

h-index: 2

Yuyang Ji

Citations: 101

h-index: 5

Tianlong Chen

Citations: 329

h-index: 9

초중등학교 과학 교실은 학생들이 현상, 증거, 설명 모델을 토론을 통해 연결하는 중요한 학습 공간입니다. 그러나 이러한 상호작용의 복잡성으로 인해 자동 분석이 어려웠습니다. 기존의 교실 토론 평가 지표는 주로 수학에 초점을 맞추고 있으며, 오직 텍스트 기록만 활용하여 차세대 과학 교육 표준(NGSS)이 강조하는 시각적 요소와 모델 기반 추론을 간과합니다. 우리는 이러한 문제를 해결하기 위해 SciIBI를 개발했습니다. SciIBI는 과학 교실 토론을 분석하기 위한 최초의 영상 데이터 기반 평가 지표로, 113개의 NGSS와 연계된 영상 클립을 핵심 교육 방법(CIP) 및 숙련도 수준에 따라 주석 처리했습니다. 우리는 최첨단 8개의 LLM 및 다중 모드 LLM을 평가하여, 현재 모델들이 교육적으로 유사한 방법들을 구별하는 데 어려움을 겪는다는 근본적인 한계를 발견했습니다. 이는 CIP 코딩이 단순한 패턴 매칭을 넘어 교육적 추론 능력을 요구한다는 것을 시사합니다. 또한, 영상 데이터를 추가해도 아키텍처에 따라 성능 향상이 일관되지 않음을 확인했습니다. 중요한 점은, 우리의 객관적인 평가는 모델들이 종종 표면적인 특징을 이용하여 성공하며, 진정한 교육적 이해를 바탕으로 작동하지 않는다는 것을 보여줍니다. 이러한 결과는 과학 교실 토론을 다중 모드 AI의 도전적인 영역으로 규정하며, 모델이 전문가의 검토를 가속화하기 위해 증거를 제공하는 인간-AI 협업의 가능성을 제시합니다.

Original Abstract

K-12 science classrooms are rich sites of inquiry where students coordinate phenomena, evidence, and explanatory models through discourse; yet, the multimodal complexity of these interactions has made automated analysis elusive. Existing benchmarks for classroom discourse focus primarily on mathematics and rely solely on transcripts, overlooking the visual artifacts and model-based reasoning emphasized by the Next Generation Science Standards (NGSS). We address this gap with SciIBI, the first video benchmark for analyzing science classroom discourse, featuring 113 NGSS-aligned clips annotated with Core Instructional Practices (CIP) and sophistication levels. By evaluating eight state-of-the-art LLMs and Multimodal LLMs, we reveal fundamental limitations: current models struggle to distinguish pedagogically similar practices, suggesting that CIP coding requires instructional reasoning beyond surface pattern matching. Furthermore, adding video input yields inconsistent gains across architectures. Crucially, our evidence-based evaluation reveals that models often succeed through surface shortcuts rather than genuine pedagogical understanding. These findings establish science classroom discourse as a challenging frontier for multimodal AI and point toward human-AI collaboration, where models retrieve evidence to accelerate expert review rather than replace it.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!