2602.10144v1 Feb 09, 2026 stat.ML

LLM 성능 저하 시 발생하는 현상: 모델 성능 저하를 감지하기 위한 통계적 접근법

When LLMs get significantly worse: A statistical approach to detect model degradations

Xiongtao Zhou

Citations: 47

h-index: 2

Kailash Budhathoki

Citations: 476

h-index: 11

Matthäus Kleindessner

Citations: 1,169

h-index: 17

Junming Yin

Citations: 118

h-index: 4

Ashish Khetan

Citations: 1

h-index: 1

George Karypis

Citations: 996

h-index: 14

Jonas M. Kübler

Citations: 21

h-index: 2

기반 모델의 추론 비용 및 지연 시간을 최소화하는 것은 매우 중요한 연구 분야가 되었습니다. 최적화 방법에는 이론적으로 손실이 없는 방법과 정확도 보장이 없는 양자화와 같은 방법이 포함됩니다. 이러한 모든 경우에 모델 품질이 저하되지 않았는지 확인하는 것이 중요합니다. 그러나 온도 0인 경우에도, 모델 생성 결과는 이론적으로 손실이 없는 모델 최적화로 인해 발생하는 수치 오류로 인해 예상치 못한 문제를 일으킬 수 있습니다. 따라서, 유한한 샘플 크기의 정확도 편차가 모델 성능 저하의 증거인지, 아니면 평가 과정에서 발생하는 (무해한) 노이즈로 인한 것인지 판단하기 위한 통계적 도구가 필요합니다. 우리는 McNemar 검정을 기반으로 한 통계적으로 타당한 가설 검정 프레임워크를 제안합니다. 이를 통해 모델 성능 저하를 효율적으로 감지하면서 동시에 오탐의 발생률을 제어할 수 있습니다. 핵심적인 통찰력은 각 샘플에 대한 모델 점수를 집계 수준이 아닌 개별적으로 비교해야 한다는 것입니다. 또한, 여러 벤치마크에서 얻은 정확도 추정치를 단일 결정으로 통합하기 위한 세 가지 방법을 제안합니다. 널리 사용되는 오픈 소스 LM 평가 도구를 기반으로 구현했으며, 이 방법이 성능 저하된 모델을 정확하게 식별하고, 이론적으로 손실이 없는 모델 최적화를 잘못 식별하지 않는다는 것을 보여주는 사례 연구를 제공합니다. 우리의 테스트 결과, 0.3%의 경험적 정확도 저하도 실제 성능 저하로 인한 것인지, 아니면 노이즈로 인한 것인지 확신할 수 있습니다.

Original Abstract

Minimizing the inference cost and latency of foundation models has become a crucial area of research. Optimization approaches include theoretically lossless methods and others without accuracy guarantees like quantization. In all of these cases it is crucial to ensure that the model quality has not degraded. However, even at temperature zero, model generations are not necessarily robust even to theoretically lossless model optimizations due to numerical errors. We thus require statistical tools to decide whether a finite-sample accuracy deviation is an evidence of a model's degradation or whether it can be attributed to (harmless) noise in the evaluation. We propose a statistically sound hypothesis testing framework based on McNemar's test allowing to efficiently detect model degradations, while guaranteeing a controlled rate of false positives. The crucial insight is that we have to confront the model scores on each sample, rather than aggregated on the task level. Furthermore, we propose three approaches to aggregate accuracy estimates across multiple benchmarks into a single decision. We provide an implementation on top of the largely adopted open source LM Evaluation Harness and provide a case study illustrating that the method correctly flags degraded models, while not flagging model optimizations that are provably lossless. We find that with our tests even empirical accuracy degradations of 0.3% can be confidently attributed to actual degradations rather than noise.

0 Citations

0 Influential

8.5 Altmetric

42.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!