2605.05687v1 May 07, 2026 cs.AI

DataDignity: 대규모 언어 모델을 위한 학습 데이터 출처 추적

DataDignity: Training Data Attribution for Large Language Models

Andrzej Banburski-Fahey

Citations: 266

h-index: 6

Jaron Lanier

Citations: 154

h-index: 3

Xiaomin Li

Citations: 129

h-index: 6

언어 모델의 출력 결과를 검증하는 과정은 단순히 정확성을 판단하는 것 이상을 요구합니다. 감사자는 응답에 표현된 지식을 가장 잘 뒷받침하는 원본 문서를 식별해야 할 수 있습니다. 본 연구에서는 이를 '정밀 출처 파악'으로 정의하며, 주어진 프롬프트, 대상 모델의 응답, 후보 문서 집합을 기반으로 응답을 가장 잘 뒷받침하는 문서를 순위를 매기는 문제를 다룹니다. 우리는 FakeWiki라는 3,537개의 가짜 위키백과 스타일 문서를 포함하는 제어된 벤치마크를 소개합니다. FakeWiki는 실제 출처 정보를 유지하면서 어휘적 단서를 약화하도록 설계되었으며, 질의응답 테스트, 원본 보존 패러프레이즈, 역으로 생성된 변형, 정답에 필수적인 사실을 제거하면서 주제적으로 유사하게 유지되는 '반사실 문서', 그리고 일반적인 프롬프트에 4가지의 탈옥(jailbreak) 영감을 받은 변환을 포함합니다. 우리는 7개의 검색 기반 모델, 학습 없이 활성화 제어를 통해 검색 결과를 융합하는 방법인 SteerFuse, 그리고 지도 학습 기반의 대비적 출처 순위 모델인 ScoringModel을 평가했습니다. ScoringModel은 응답 및 문서의 특징을 공유 공간에 매핑하고, InfoNCE 손실 함수를 사용하여 배치 내 데이터, 검색된 데이터, 그리고 반사실 문서를 부정 샘플로 활용하여 학습됩니다. 9개의 공개 가중치 기반의 명령어 튜닝된 LLM과 5가지의 질의 조건에서 ScoringModel은 가장 강력한 검색 기반 모델의 평균 Recall@10 성능인 35.0을 52.2로 향상시켰으며, 추론 시 융합 과정 없이 41/45의 모델-조건 조합에서 우수한 성능을 보였습니다. SteerFuse는 지도 학습이 필요하지 않음에도 불구하고 일반적으로 두 번째로 좋은 성능을 보여주며, 이는 활성화 공간의 정보가 텍스트 검색을 효율적으로 보완할 수 있음을 시사합니다. 탈옥 영감을 받은 변환된 질의에서 ScoringModel은 평균적으로 가장 좋은 기준 모델보다 Recall@10을 15.7 포인트 향상시켰습니다. 전반적으로, 본 연구는 강력한 학습 데이터 출처 추적을 위해서는 실제 정답 지원과 주제적 또는 어휘적 유사성을 분리하는 평가 환경이 필요함을 보여줍니다.

Original Abstract

Auditing language-model outputs often requires more than judging correctness: an auditor may need to identify which source document most likely supports the knowledge expressed in a response. We study this as pinpoint provenance: given a prompt, a target-model response, and a candidate corpus, rank the documents that best support the response. We introduce FakeWiki, a controlled benchmark of 3,537 fabricated Wikipedia-style articles designed to preserve ground-truth provenance while weakening lexical shortcuts. FakeWiki includes QA probes, source-preserving paraphrases, retro-generated variants, hard anti-documents that remain topically similar while removing answer-critical facts, and five query conditions: clean prompting plus four jailbreak-inspired transformations. We evaluate seven retrieval baselines, a training-free activation-steering retrieval-fusion method, SteerFuse, and a supervised contrastive provenance ranker, ScoringModel. ScoringModel maps response and document features into a shared space and is trained with InfoNCE using in-batch, retrieval-mined, and anti-document negatives. Across nine open-weight instruction-tuned LLMs and five query conditions, ScoringModel improves mean Recall@10 from 35.0 for the strongest retrieval baseline to 52.2, without inference-time fusion, and wins 41/45 model-by-condition cells. SteerFuse is usually second-best despite requiring no supervised training, showing that activation-space evidence can efficiently complement text retrieval. On jailbreak-inspired transformed queries, ScoringModel improves Recall@10 by 15.7 points on average over the best baseline. Overall, our work shows that robust training data attribution requires evaluation settings that separate true answer support from topical or lexical resemblance.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!