2602.15034v1 Jan 22, 2026 cs.CL

EduResearchBench: 전체 교육 연구 생명 주기를 위한 계층적 원자적 작업 분해 벤치마크

EduResearchBench: A Hierarchical Atomic Task Decomposition Benchmark for Full-Lifecycle Educational Research

Aimin Zhou

Citations: 0

h-index: 0

Bo Jiang

Citations: 6

h-index: 1

Bingdong Li

Citations: 0

h-index: 0

Hao Hao

Citations: 1

h-index: 1

Houping Yue

Citations: 1

h-index: 1

Zixiang Di

Citations: 25

h-index: 2

Mei Jiang

Citations: 15

h-index: 1

Yu Song

Citations: 113

h-index: 3

대규모 언어 모델(LLM)이 사회과학 인공지능(AI4SS)의 패러다임을 변화시키고 있지만, 학술적 글쓰기에서의 LLM의 능력을 엄격하게 평가하는 것은 여전히 중요한 과제입니다. 기존 벤치마크는 주로 단일 단계의 전체적인 생성에 초점을 맞추므로, 복잡한 학술 연구 워크플로우를 반영하는 세분화된 평가가 부족합니다. 이러한 격차를 해소하기 위해, 우리는 교육 학술 글쓰기에 특화된 종합적인 평가 플랫폼인 EduResearchBench를 소개합니다. EduResearchBench는 6개의 전문 연구 모듈(예: 양적 분석, 질적 연구, 정책 연구)과 24개의 세분화된 원자적 작업으로 구성된 엔드투엔드 연구 워크플로우를 분해하는 우리의 계층적 원자적 작업 분해(HATD) 프레임워크를 기반으로 합니다. 이러한 분류 체계는 자동화된 평가 파이프라인을 가능하게 하며, 전체적인 점수가 특정 능력의 한계를 가리는 주요한 한계를 완화하고, 구체적인 결점에 대한 세분화되고 진단적인 피드백을 제공합니다. 또한, 학술적 글쓰기에 내재된 높은 인지적 부담을 인식하고, 기초 기술부터 복잡한 방법론적 추론 및 논증에 이르기까지 점진적으로 역량을 구축하는 커리큘럼 학습 전략을 제안합니다. 55,000개의 원본 학술 샘플을 활용하여, 우리는 EduWrite라는 특화된 교육 학술 글쓰기 모델을 훈련시키기 위한 11,000개의 고품질 지시 쌍을 큐레이션했습니다. 실험 결과, EduWrite(30B)는 여러 핵심 지표에서 더 큰 범용 모델(72B)보다 훨씬 뛰어난 성능을 보였으며, 수직 분야에서 데이터 품질 밀도와 계층적으로 구성된 훈련 커리큘럼이 매개변수 규모보다 더 결정적인 역할을 한다는 것을 보여줍니다.

Original Abstract

While Large Language Models (LLMs) are reshaping the paradigm of AI for Social Science (AI4SS), rigorously evaluating their capabilities in scholarly writing remains a major challenge. Existing benchmarks largely emphasize single-shot, monolithic generation and thus lack the fine-grained assessments required to reflect complex academic research workflows. To fill this gap, we introduce EduResearchBench, the first comprehensive evaluation platform dedicated to educational academic writing. EduResearchBench is built upon our Hierarchical Atomic Task Decomposition (HATD) framework, which decomposes an end-to-end research workflow into six specialized research modules (e.g., Quantitative Analysis, Qualitative Research, and Policy Research) spanning 24 fine-grained atomic tasks. This taxonomy enables an automated evaluation pipeline that mitigates a key limitation of holistic scoring, where aggregate scores often obscure specific capability bottlenecks, and instead provides fine-grained, diagnostic feedback on concrete deficiencies. Moreover, recognizing the high cognitive load inherent in scholarly writing, we propose a curriculum learning strategy that progressively builds competence from foundational skills to complex methodological reasoning and argumentation. Leveraging 55K raw academic samples, we curate 11K high-quality instruction pairs to train EduWrite, a specialized educational scholarly writing model. Experiments show that EduWrite (30B) substantially outperforms larger general-purpose models (72B) on multiple core metrics, demonstrating that in vertical domains, data quality density and hierarchically staged training curricula are more decisive than parameter scale.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!