2602.03812v1 Feb 03, 2026 cs.LG

반증 증빙 기술 (Antidistillation Fingerprinting)

Antidistillation Fingerprinting

Asher Trockman

Citations: 1,004

h-index: 11

Yixuan Even Xu

Citations: 86

h-index: 3

John Kirchenbauer

University of Maryland, College Park

Citations: 2,019

h-index: 10

A. Robey

Citations: 50

h-index: 3

Tom Goldstein

Citations: 354

h-index: 6

Fei Fang

Citations: 127

h-index: 5

J. Kolter

Citations: 40,764

h-index: 74

Yash Savani

Stanford University

Citations: 862

h-index: 11

모델 증류는 최첨단 대규모 언어 모델(LLM)을 효율적으로 모방하는 데 사용되지만, 이는 제3자 학생 모델이 특정 교사 모델의 출력을 기반으로 훈련되었는지 여부를 탐지하는 강력한 메커니즘의 필요성을 야기합니다. 기존의 증빙 기술은 이러한 증류를 탐지하는 데 사용될 수 있지만, 이러한 기술은 생성 품질과 증빙 강도 간의 균형을 맞추기 위해 휴리스틱 기반의 변조를 사용하며, 종종 증빙이 학생 모델에 효과적으로 내재화되도록 하기 위해 유용성이 크게 저하되는 경향이 있습니다. 본 연구에서는 반증 증빙 기술(ADFP)을 소개합니다. ADFP는 증빙 목표를 학생 모델의 학습 역학에 맞춰 정교한 접근 방식을 제공합니다. ADFP는 반증 샘플링의 기반이 되는 경사 기반 프레임워크를 활용하여, 튜닝 후 학생 모델에서 증빙의 탐지 가능성을 극대화하는 토큰을 식별하고 샘플링하는 데 사용되는 프록시 모델을 사용합니다. 이는 보다 단순한 워터마크가 갖는 의도치 않은 편향을 우연히 흡수하는 방식과는 다릅니다. GSM8K 및 OASST1 벤치마크에 대한 실험 결과, ADFP는 최첨단 기술에 비해 상당한 성능 향상을 보여주며, 학생 모델의 아키텍처가 알려지지 않은 경우에도 유용성에 미치는 영향이 최소화된 상태에서 더 강력한 탐지 신뢰도를 제공합니다.

Original Abstract

Model distillation enables efficient emulation of frontier large language models (LLMs), creating a need for robust mechanisms to detect when a third-party student model has trained on a teacher model's outputs. However, existing fingerprinting techniques that could be used to detect such distillation rely on heuristic perturbations that impose a steep trade-off between generation quality and fingerprinting strength, often requiring significant degradation of utility to ensure the fingerprint is effectively internalized by the student. We introduce antidistillation fingerprinting (ADFP), a principled approach that aligns the fingerprinting objective with the student's learning dynamics. Building upon the gradient-based framework of antidistillation sampling, ADFP utilizes a proxy model to identify and sample tokens that directly maximize the expected detectability of the fingerprint in the student after fine-tuning, rather than relying on the incidental absorption of the un-targeted biases of a more naive watermark. Experiments on GSM8K and OASST1 benchmarks demonstrate that ADFP achieves a significant Pareto improvement over state-of-the-art baselines, yielding stronger detection confidence with minimal impact on utility, even when the student model's architecture is unknown.

0 Citations

0 Influential

30 Altmetric

150.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!