2603.00451v1 Feb 28, 2026 cs.AI

LLM 기반 자동 평가를 위한 오해 방지 채점 기준 최적화

Confusion-Aware Rubric Optimization for LLM-based Automated Grading

Kaiqi Yang

Citations: 140

h-index: 7

Yasemin Copur-Gencturk

Citations: 64

h-index: 3

Namsoon Shin

Citations: 4

h-index: 1

Jiliang Tang

Citations: 903

h-index: 9

Yuch-Chaio Chu

Citations: 6

h-index: 2

Hang Li

Citations: 39

h-index: 3

Joseph Krajcik

Citations: 408

h-index: 5

정확하고 명확한 지침은 대규모 언어 모델(LLM) 기반 평가 시스템에 매우 중요하지만, 이러한 지침을 수동으로 작성하는 것은 종종 최적이 아니며, LLM이 전문가의 지침을 오해하거나 필요한 전문 지식이 부족할 수 있기 때문입니다. 따라서, 현재 연구는 수동적인 시행착오 없이 채점 지침을 개선하기 위해 자동 프롬프트 최적화 기술을 개발하는 방향으로 나아가고 있습니다. 그러나 기존 프레임워크는 일반적으로 독립적이고 구조화되지 않은 오류 샘플을 하나의 업데이트 단계로 통합하여,

Original Abstract

Accurate and unambiguous guidelines are critical for large language model (LLM) based graders, yet manually crafting these prompts is often sub-optimal as LLMs can misinterpret expert guidelines or lack necessary domain specificity. Consequently, the field has moved toward automated prompt optimization to refine grading guidelines without the burden of manual trial and error. However, existing frameworks typically aggregate independent and unstructured error samples into a single update step, resulting in "rule dilution" where conflicting constraints weaken the model's grading logic. To address these limitations, we introduce Confusion-Aware Rubric Optimization (CARO), a novel framework that enhances accuracy and computational efficiency by structurally separating error signals. CARO leverages the confusion matrix to decompose monolithic error signals into distinct modes, allowing for the diagnosis and repair of specific misclassification patterns individually. By synthesizing targeted "fixing patches" for dominant error modes and employing a diversity-aware selection mechanism, the framework prevents guidance conflict and eliminates the need for resource-heavy nested refinement loops. Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods. These results suggest that replacing mixed-error aggregation with surgical, mode-specific repair yields robust improvements in automated assessment scalability and precision.

3 Citations

1 Influential

4.5 Altmetric

27.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!