LLM 기반 자동 평가를 위한 오해 방지 채점 기준 최적화
Confusion-Aware Rubric Optimization for LLM-based Automated Grading
정확하고 명확한 지침은 대규모 언어 모델(LLM) 기반 평가 시스템에 매우 중요하지만, 이러한 지침을 수동으로 작성하는 것은 종종 최적이 아니며, LLM이 전문가의 지침을 오해하거나 필요한 전문 지식이 부족할 수 있기 때문입니다. 따라서, 현재 연구는 수동적인 시행착오 없이 채점 지침을 개선하기 위해 자동 프롬프트 최적화 기술을 개발하는 방향으로 나아가고 있습니다. 그러나 기존 프레임워크는 일반적으로 독립적이고 구조화되지 않은 오류 샘플을 하나의 업데이트 단계로 통합하여,
Accurate and unambiguous guidelines are critical for large language model (LLM) based graders, yet manually crafting these prompts is often sub-optimal as LLMs can misinterpret expert guidelines or lack necessary domain specificity. Consequently, the field has moved toward automated prompt optimization to refine grading guidelines without the burden of manual trial and error. However, existing frameworks typically aggregate independent and unstructured error samples into a single update step, resulting in "rule dilution" where conflicting constraints weaken the model's grading logic. To address these limitations, we introduce Confusion-Aware Rubric Optimization (CARO), a novel framework that enhances accuracy and computational efficiency by structurally separating error signals. CARO leverages the confusion matrix to decompose monolithic error signals into distinct modes, allowing for the diagnosis and repair of specific misclassification patterns individually. By synthesizing targeted "fixing patches" for dominant error modes and employing a diversity-aware selection mechanism, the framework prevents guidance conflict and eliminates the need for resource-heavy nested refinement loops. Empirical evaluations on teacher education and STEM datasets demonstrate that CARO significantly outperforms existing SOTA methods. These results suggest that replacing mixed-error aggregation with surgical, mode-specific repair yields robust improvements in automated assessment scalability and precision.
No Analysis Report Yet
This paper hasn't been analyzed by Gemini yet.