2603.25450v1 Mar 26, 2026 cs.AI

모델 간 불일치를 활용한 레이블 없는 정확도 신호

Cross-Model Disagreement as a Label-Free Correctness Signal

Citations: 63

h-index: 4

Citations: 75

h-index: 4

안전한 배포를 위해서는, 정답 레이블 없이 언어 모델이 오류를 범하는 시점을 감지하는 것이 중요한 과제입니다. 기존 방법들은 모델 자체의 불확실성, 예를 들어 토큰 엔트로피 또는 신뢰도 점수를 활용하지만, 이러한 신호는 모델이 틀렸음에도 확신하는 가장 위험한 오류 상황에서 제대로 작동하지 못합니다. 본 연구에서는 모델 간 불일치를 정확도 지표로 제시합니다. 이는 간단하며, 기존 시스템, 파이프라인 및 배포 모니터링 인프라에 수정 없이 적용할 수 있는 학습이 필요 없는 신호입니다. 모델이 생성한 답변에 대해, 모델 간 불일치는 검증 모델이 해당 답변을 읽을 때 얼마나 놀라거나 불확실한지를 단일 순방향 패스를 통해 계산합니다. 검증 모델의 답변 생성은 필요 없으며, 정확도 레이블 또한 필요하지 않습니다. 우리는 이 원리를 Cross-Model Perplexity (CMP)로 구현했습니다. CMP는 검증 모델이 생성 모델의 답변 토큰에 대해 느끼는 놀라움을 측정하며, Cross-Model Entropy (CME)는 검증 모델이 해당 위치에서 느끼는 불확실성을 측정합니다. CMP와 CME는 추론, 검색 및 수학 문제 해결을 포함하는 다양한 벤치마크에서 모델 내부의 불확실성 기반 방법을 능가하는 성능을 보였습니다 (MMLU, TriviaQA, GSM8K). MMLU 데이터셋에서, CMP는 모델 내부 엔트로피 기준 (0.59)에 비해 평균 AUROC 0.75를 달성했습니다. 이러한 결과는 모델 간 불일치가 레이블이 없는 정확도 추정을 위한 실용적이고 학습이 필요 없는 방법임을 입증하며, 배포 모니터링, 모델 라우팅, 선택적 예측, 데이터 필터링 및 프로덕션 언어 모델 시스템에 대한 확장 가능한 감독에 직접 적용될 수 있습니다.

Original Abstract

Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as Cross-Model Perplexity (CMP), which measures the verifying model's surprise at the generating model's answer tokens, and Cross-Model Entropy (CME), which measures the verifying model's uncertainty at those positions. Both CMP and CME outperform within-model uncertainty baselines across benchmarks spanning reasoning, retrieval, and mathematical problem solving (MMLU, TriviaQA, and GSM8K). On MMLU, CMP achieves a mean AUROC of 0.75 against a within-model entropy baseline of 0.59. These results establish cross-model disagreement as a practical, training-free approach to label-free correctness estimation, with direct applications in deployment monitoring, model routing, selective prediction, data filtering, and scalable oversight of production language model systems.

2 Citations

0 Influential

2 Altmetric

12.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!