2604.05005v1 Apr 06, 2026 cs.CY

EduIllustrate: 다중 모드 교육 콘텐츠의 확장 가능한 자동 생성 연구

EduIllustrate: Towards Scalable Automated Generation Of Multimodal Educational Content

Keqian Li

Citations: 29,166

h-index: 8

Aimin Zhou

Citations: 1

h-index: 1

Shuzhen Bi

Citations: 0

h-index: 0

Mingzi Zhang

Citations: 14

h-index: 1

Zhuoxuan Li

Citations: 0

h-index: 0

Xiaolong Wang

Citations: 39

h-index: 3

최근 대규모 언어 모델(LLM)은 교육 지원 도구로 활용되고 있지만, 그 교육적 능력에 대한 평가는 주로 질문-응답 및 튜터링 작업에 집중되어 있습니다. 본 연구에서는 멀티미디어 교육 콘텐츠 생성, 즉 기하학적으로 정확한 시각 자료와 단계별 추론을 결합하여 일관성 있는 설명 도표를 생성하는 능력에 대한 중요한 격차가 존재함을 지적합니다. 본 연구는 K-12 STEM 문제에 대한 LLM의 텍스트-도표 설명 생성 능력을 평가하기 위한 벤치마크인 EduIllustrate를 제시합니다. 벤치마크는 5개 과목, 3개 학년에 걸쳐 230개의 문제로 구성되며, 시각적 일관성을 확보하기 위한 순차적 앵커링을 포함한 표준화된 생성 프로토콜과 멀티미디어 학습 이론에 기반한 8가지 평가 기준(텍스트 및 시각 품질)을 포함합니다. 10개의 LLM에 대한 평가는 성능 편차가 크다는 것을 보여주었습니다. Gemini 3.0 Pro Preview 모델은 87.8%의 가장 높은 성능을 보였으며, Kimi-K2.5 모델은 문제당 0.12달러의 가장 낮은 비용으로 80.8%의 효율성을 달성했습니다. 워크플로우 분석 결과, 순차적 앵커링은 시각적 일관성을 13% 향상시키면서 비용은 94% 절감되었습니다. 20명의 전문가 평가자가 참여한 인간 평가를 통해 LLM을 평가자로 사용할 수 있는 신뢰성을 객관적인 측면에서 검증했습니다(ρ ≥ 0.83). 하지만 주관적인 시각적 평가에는 한계가 있음을 확인했습니다.

Original Abstract

Large language models are increasingly used as educational assistants, yet evaluation of their educational capabilities remains concentrated on question-answering and tutoring tasks. A critical gap exists for multimedia instructional content generation -- the ability to produce coherent, diagram-rich explanations that combine geometrically accurate visuals with step-by-step reasoning. We present EduIllustrate, a benchmark for evaluating LLMs on interleaved text-diagram explanation generation for K-12 STEM problems. The benchmark comprises 230 problems spanning five subjects and three grade levels, a standardized generation protocol with sequential anchoring to enforce cross-diagram visual consistency, and an 8-dimension evaluation rubric grounded in multimedia learning theory covering both text and visual quality. Evaluation of ten LLMs reveals a wide performance spread: Gemini 3.0 Pro Preview leads at 87.8\%, while Kimi-K2.5 achieves the best cost-efficiency (80.8\% at \\$0.12/problem). Workflow ablation confirms sequential anchoring improves Visual Consistency by 13\% at 94\% lower cost. Human evaluation with 20 expert raters validates LLM-as-judge reliability for objective dimensions ($ρ\geq 0.83$) while revealing limitations on subjective visual assessment.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!