2603.20957v1 Mar 21, 2026 cs.CL

정렬의 whack-어-몰 게임: 파인튜닝은 대규모 언어 모델에서 저작권 보호 도서의 정확한 복원을 활성화시킨다

Alignment Whack-a-Mole : Finetuning Activates Verbatim Recall of Copyrighted Books in Large Language Models

Tuhin Chakrabarty

Columbia University

Citations: 2,074

h-index: 23

Niloofar Mireshghallah

Citations: 1,338

h-index: 16

Xinyue Liu

Citations: 32

h-index: 4

Jane C. Ginsburg

Citations: 40

h-index: 4

최첨단 LLM 기업들은 법원과 규제 기관에 그들의 모델이 훈련 데이터의 복사본을 저장하지 않는다고 반복적으로 주장해왔습니다. 또한, RLHF(강화 학습 기반 인간 피드백), 시스템 프롬프트, 그리고 출력 필터와 같은 안전 정렬 전략을 사용하여 저작권 있는 작품의 정확한 복제를 방지하고 있으며, 이러한 조치의 효과를 저작권 침해 주장에 대한 법적 방어의 근거로 제시해 왔습니다. 본 연구에서는 파인튜닝이 이러한 보호 장치를 우회한다는 것을 보여줍니다. 모델을 줄거리 요약을 전체 텍스트로 확장하도록 훈련시키는 작업을 통해, 이는 상업용 글쓰기 도우미에 적합한 작업입니다. 그 결과, GPT-4o, Gemini-2.5-Pro, 그리고 DeepSeek-V3.1 모델이 시맨틱 설명만 프롬프트로 사용하고 실제 도서 텍스트는 전혀 사용하지 않았음에도 불구하고, 최대 85-90%의 저작권 있는 도서를 정확하게 복제하는 것을 확인했습니다. 이 추출 방식은 특정 저자에게 국한되지 않습니다. 하루키 무라카미의 소설만을 사용하여 파인튜닝하면, 30명 이상의 관련 없는 저자의 저작권 있는 도서 내용을 정확하게 복원할 수 있습니다. 이 효과는 특정 훈련 저자나 코퍼스에 국한되지 않습니다. 무작위 저자 조합과 공개 도메인 파인튜닝 데이터는 유사한 추출 결과를 보여주며, 합성 텍스트로 파인튜닝하면 거의 0에 가까운 추출 결과를 얻습니다. 이는 특정 저자의 작품으로 파인튜닝하면 사전 훈련 중에 잠재적으로 저장된 정보를 다시 활성화시킨다는 것을 시사합니다. 서로 다른 제공업체의 세 모델이 동일한 도서의 동일한 부분을 기억한다는 사실은 업계 전체의 취약점을 나타냅니다. 본 연구의 결과는 모델 가중치가 저작권 있는 작품의 복사본을 저장하고 있으며, 개별 저자의 작품으로 파인튜닝한 후 발생하는 보안 실패가 최근의 공정 이용 판결의 핵심 전제를 훼손한다는 강력한 증거를 제공합니다. 이러한 판결에서는 저작권 보호 표현의 복제를 방지하는 조치의 적절성을 기준으로 유리한 결과를 내리고 있습니다.

Original Abstract

Frontier LLM companies have repeatedly assured courts and regulators that their models do not store copies of training data. They further rely on safety alignment strategies via RLHF, system prompts, and output filters to block verbatim regurgitation of copyrighted works, and have cited the efficacy of these measures in their legal defenses against copyright infringement claims. We show that finetuning bypasses these protections: by training models to expand plot summaries into full text, a task naturally suited for commercial writing assistants, we cause GPT-4o, Gemini-2.5-Pro, and DeepSeek-V3.1 to reproduce up to 85-90% of held-out copyrighted books, with single verbatim spans exceeding 460 words, using only semantic descriptions as prompts and no actual book text. This extraction generalizes across authors: finetuning exclusively on Haruki Murakami's novels unlocks verbatim recall of copyrighted books from over 30 unrelated authors. The effect is not specific to any training author or corpus: random author pairs and public-domain finetuning data produce comparable extraction, while finetuning on synthetic text yields near-zero extraction, indicating that finetuning on individual authors' works reactivates latent memorization from pretraining. Three models from different providers memorize the same books in the same regions ($r \ge 0.90$), pointing to an industry-wide vulnerability. Our findings offer compelling evidence that model weights store copies of copyrighted works and that the security failures that manifest after finetuning on individual authors' works undermine a key premise of recent fair use rulings, where courts have conditioned favorable outcomes on the adequacy of measures preventing reproduction of protected expression.

1 Citations

0 Influential

11.5 Altmetric

58.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!