2602.19020v1 Feb 22, 2026 cs.LG

능동적 재구성을 통한 언어 모델 학습 데이터 탐지 학습

Learning to Detect Language Model Training Data via Active Reconstruction

Hanna Hajishirzi

Citations: 6,512

h-index: 30

John X. Morris

Citations: 459

h-index: 7

Vitaly Shmatikov

Citations: 294

h-index: 7

Sewon Min

Citations: 16,578

h-index: 38

J. Yin

Citations: 24

h-index: 3

LLM 학습 데이터 탐지는 일반적으로 멤버십 추론 공격(MIA) 문제로 정의됩니다. 그러나 기존의 MIA는 로그 우도(log-likelihood)나 텍스트 생성을 활용하여 고정된 모델 가중치에 대해 수동적으로 작동합니다. 본 연구에서는 학습을 통해 모델이 주어진 텍스트를 재구성하도록 능동적으로 유도하는 MIA 계열인 능동적 데이터 재구성 공격(Active Data Reconstruction Attack, ADRA)을 소개합니다. 우리는 학습 데이터가 비학습 데이터(non-members)보다 더 쉽게 재구성될 수 있으며, 이러한 재구성 가능성의 차이를 멤버십 추론에 활용할 수 있다고 가설을 세웠습니다. 강화학습(RL)이 가중치에 이미 내재된 동작을 더 뚜렷하게 만든다는 연구 결과에 착안하여, 우리는 타겟 모델에서 초기화된 정책을 미세 조정하는 온폴리시(on-policy) RL을 활용해 데이터 재구성을 능동적으로 이끌어냅니다. MIA에 RL을 효과적으로 적용하기 위해 재구성 지표와 대조 보상(contrastive rewards)을 설계했습니다. 그 결과 도출된 알고리즘인 ADRA와 그 적응형 변형인 ADRA+는 후보 데이터 풀이 주어졌을 때 재구성 및 탐지 능력을 모두 향상시킵니다. 실험 결과, 제안한 방법은 사전 학습, 사후 학습, 지식 증류 데이터 탐지에서 기존 MIA보다 일관되게 우수한 성능을 보였으며, 기존의 차점자(runner-up) 모델 대비 평균 10.7%의 성능 향상을 달성했습니다. 특히 ADRA+는 사전 학습 탐지를 위한 BookMIA에서 Min-K%++ 대비 18.8%, 사후 학습 탐지를 위한 AIME에서 7.6% 성능을 향상시켰습니다.

Original Abstract

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

0 Citations

0 Influential

19 Altmetric

95.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!