2601.04895v1 Jan 08, 2026 cs.AI

DVD: 대규모 언어 모델 평가에서 변형 오염을 탐지하기 위한 강건한 방법

DVD: A Robust Method for Detecting Variant Contamination in Large Language Model Evaluation

Renzhao Liang

Citations: 6

h-index: 1

Jingru Chen

Citations: 2

h-index: 1

B. Deng

Citations: 19

h-index: 1

Chenggang Xie

Citations: 8

h-index: 1

Yidong Wang

Citations: 372

h-index: 3

Xin Wang

Citations: 1

h-index: 1

Linfeng Zhang

Citations: 3,280

h-index: 24

Cunxiang Wang

Citations: 599

h-index: 6

Bowen Jia

Citations: 95

h-index: 2

Ke Jin

Citations: 182

h-index: 6

대규모 언어 모델(LLM)의 평가는 '변형 오염(variant contamination)'으로 인해 점점 더 어려워지고 있습니다. 변형 오염이란 훈련 말뭉치에 테스트 항목과 의미적으로는 동일하지만 어휘적 또는 구문적으로 변경된 버전이 포함되는 현상을 말합니다. 문자 그대로의 유출(verbatim leakage)과는 달리, 이러한 의역되거나 구조적으로 변형된 이형들은 샘플링 일관성이나 퍼플렉시티(perplexity)에 기반한 기존 탐지기를 회피하며, 진정한 추론 능력이 아닌 암기를 통해 벤치마크 점수를 부풀립니다. 우리는 이 문제를 공식화하고, 온도 샘플링(temperature sampling)에 의해 유도된 국소 출력 분포를 모델링하는 단일 샘플 탐지기인 DVD(Detection via Variance of generation Distribution)를 소개합니다. 우리의 핵심 통찰은 오염된 항목이 '기억 고수(memory-adherence)' 상태와 '섭동 표류(perturbation-drift)' 상태 사이의 교대를 유발하여, 낮은 확률 토큰들의 합성 난이도에서 비정상적으로 높은 분산을 생성한다는 것입니다. 반면 오염되지 않은 항목은 비교적 완만한 분산을 보이며 표류 상태에 머무릅니다. 우리는 의미적으로 동일한 변형들을 생성 및 필터링하여 두 가지 도메인(Omni-MATH 및 SuperGPQA)에 걸쳐 변형 오염에 대한 최초의 벤치마크를 구축하고, 다양한 규모와 아키텍처의 모델(Qwen2.5 및 Llama3.1)을 미세 조정하여 오염을 시뮬레이션했습니다. 데이터셋과 모델 전반에 걸쳐 DVD는 퍼플렉시티 기반 방법, Min-k%++, 편집 거리(CDD), 임베딩 유사도 베이스라인을 일관되게 능가했으며, 하이퍼파라미터에 대해 강력한 강건성을 보였습니다. 우리의 연구 결과는 생성 분포의 분산이 LLM 평가에서 변형 오염을 탐지하기 위한 원칙적이고 실용적인 지표임을 입증합니다.

Original Abstract

Evaluating large language models (LLMs) is increasingly confounded by \emph{variant contamination}: the training corpus contains semantically equivalent yet lexically or syntactically altered versions of test items. Unlike verbatim leakage, these paraphrased or structurally transformed variants evade existing detectors based on sampling consistency or perplexity, thereby inflating benchmark scores via memorization rather than genuine reasoning. We formalize this problem and introduce \textbf{DVD} (\textbf{D}etection via \textbf{V}ariance of generation \textbf{D}istribution), a single-sample detector that models the local output distribution induced by temperature sampling. Our key insight is that contaminated items trigger alternation between a \emph{memory-adherence} state and a \emph{perturbation-drift} state, yielding abnormally high variance in the synthetic difficulty of low-probability tokens; uncontaminated items remain in drift with comparatively smooth variance. We construct the first benchmark for variant contamination across two domains Omni-MATH and SuperGPQA by generating and filtering semantically equivalent variants, and simulate contamination via fine-tuning models of different scales and architectures (Qwen2.5 and Llama3.1). Across datasets and models, \textbf{DVD} consistently outperforms perplexity-based, Min-$k$\%++, edit-distance (CDD), and embedding-similarity baselines, while exhibiting strong robustness to hyperparameters. Our results establish variance of the generation distribution as a principled and practical fingerprint for detecting variant contamination in LLM evaluation.

1 Citations

0 Influential

12 Altmetric

61.0 Score

Original PDF

AI Analysis

Korean Summary

이 논문은 대규모 언어 모델(LLM) 평가에서 훈련 데이터가 테스트 데이터의 의미적 변형(의역, 구조 변경 등)을 포함하는 '변형 오염(variant contamination)' 문제를 해결하기 위한 새로운 탐지 방법인 DVD(Detection via Variance of generation Distribution)를 제안합니다. 저자들은 오염된 데이터에 대해 모델이 '기억 의존 상태(memory-adherence)'와 '섭동 표류 상태(perturbation-drift)'를 오가는 현상을 발견했습니다. DVD는 온도 샘플링을 통해 다수의 응답을 생성하고, 하위 확률 토큰들의 '합성 난이도(synthetic difficulty)' 분산을 측정하여 이러한 상태 변화를 포착합니다. 실험 결과, DVD는 기존의 펄플렉서티(Perplexity)나 임베딩 유사도 기반 방법들보다 Omni-MATH 및 SuperGPQA 벤치마크에서 월등히 높은 오염 탐지 성능을 보였으며, 하이퍼파라미터 변화에도 강건함을 입증했습니다.

Key Innovations

생성 분포의 분산(Variance of Generation Distribution)을 활용한 새로운 오염 탐지 지표(DVD) 제안
오염된 샘플에서의 모델 거동을 '기억 의존 상태'와 '섭동 표류 상태'의 혼합 분포로 이론화
낮은 확률 토큰들의 로그 우도를 기반으로 한 '합성 난이도(Synthetic Difficulty)' 개념 도입
의미적으로 동일하지만 표면적 형태가 다른 변형 오염(Variant Contamination)을 체계적으로 평가하기 위한 벤치마크 구축

Learning & Inference Impact

이 방법은 학습 과정(Training) 자체를 변경하지 않는 'Training-free' 탐지 방식입니다. 추론(Inference) 단계에서는 단일 테스트 항목에 대해 온도 샘플링(Temperature Sampling)을 적용하여 여러 번(논문에서는 50회) 응답을 생성해야 하므로, 단일 생성 방식에 비해 추론 연산 비용이 증가합니다. 그러나 훈련 데이터셋에 직접 접근할 필요 없이 모델의 출력만으로 오염 여부를 판단할 수 있어, 폐쇄형 모델이나 데이터 접근이 제한적인 환경에서 모델의 일반화 능력을 검증하고 '암기에 의한 성능 부풀리기'를 방지하는 데 중요한 영향을 미칩니다.

Technical Difficulty

중급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!