2604.04482v1 Apr 06, 2026 cs.AI

다중 모달 대규모 언어 모델을 활용한 확장 가능하고 설명 가능한 학습자-동영상 상호작용 예측

Scalable and Explainable Learner-Video Interaction Prediction using Multimodal Large Language Models

Fares Fawzi

Citations: 23

h-index: 3

Tanja Kaser

Citations: 168

h-index: 7

Dominik Glandorf

Citations: 36

h-index: 3

학습자가 교육 동영상에서 사용하는 제어 기능은 인지 처리 과정과 교육 설계 품질에 대한 간접적인 신호를 제공하지만, 확장 가능하고 설명 가능한 예측 모델의 부족은 강사가 배포 전에 그러한 행동을 예측하는 능력을 제한합니다. 본 연구에서는 동영상 콘텐츠만으로 인지 부하를 나타내는 시청, 일시 정지, 건너뛰기, 되감기 등의 행동을 예측하는 확장 가능하고 해석 가능한 파이프라인을 제안합니다. 제안하는 방법은 다중 모달 대규모 언어 모델(MLLM)을 활용하여 짧은 동영상 세그먼트의 임베딩을 계산하고, 신경망 분류기를 훈련하여 시간적으로 정밀한 상호작용 지점을 식별합니다. 최적의 인지 부하를 위한 교육 설계에 대한 멀티미디어 학습 이론을 바탕으로, GPT-5를 사용하여 동영상 세그먼트의 특징을 코딩하고 이를 개념 활성화 벡터를 통해 모델 예측을 해석하는 기초로 사용합니다. 본 연구는 66개의 온라인 강좌에서 수집된 7,700만 건의 동영상 제어 이벤트 데이터를 사용하여 제안하는 파이프라인을 평가했습니다. 연구 결과는 MLLM 임베딩을 기반으로 하는 분류기가 상호작용 지점을 신뢰성 있게 예측하고, 새로운 학문 분야에 일반화되며, 해석 가능하고 이론적으로 관련된 교육 개념을 포함한다는 것을 보여줍니다. 전반적으로, 본 연구 결과는 비용 효율적이고 해석 가능한 교육 동영상 설계의 사전 검토 가능성을 입증하며, 멀티미디어 학습 이론을 대규모로 경험적으로 검증할 수 있는 새로운 기회를 제공합니다.

Original Abstract

Learners' use of video controls in educational videos provides implicit signals of cognitive processing and instructional design quality, yet the lack of scalable and explainable predictive models limits instructors' ability to anticipate such behavior before deployment. We propose a scalable, interpretable pipeline for predicting population-level watching, pausing, skipping, and rewinding behavior as proxies for cognitive load from video content alone. Our approach leverages multimodal large language models (MLLMs) to compute embeddings of short video segments and trains a neural classifier to identify temporally fine-grained interaction peaks. Drawing from multimedia learning theory on instructional design for optimal cognitive load, we code features of the video segments using GPT-5 and employ them as a basis for interpreting model predictions via concept activation vectors. We evaluate our pipeline on 77 million video control events from 66 online courses. Our findings demonstrate that classifiers based on MLLM embeddings reliably predict interaction peaks, generalize to unseen academic fields, and encode interpretable, theory-relevant instructional concepts. Overall, our results show the feasibility of cost-efficient, interpretable pre-screening of educational video design and open new opportunities to empirically examine multimedia learning theory at scale.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!