2602.03183v1 Feb 03, 2026 cs.CL

Privasis: 처음부터 구축된 가장 큰 '공개' 개인 데이터셋 합성

Privasis: Synthesizing the Largest "Public" Private Dataset from Scratch

David Acuna

Citations: 151

h-index: 4

Jaehun Jung

Citations: 243

h-index: 8

Hyunwoo Kim

Citations: 32

h-index: 2

S. Li

Citations: 683

h-index: 13

Pang Wei Koh

Citations: 16

h-index: 2

Niloofar Mireshghallah

Citations: 1,338

h-index: 16

Michael Duan

Citations: 221

h-index: 3

R. Xin

Citations: 183

h-index: 2

Qi Pang

Citations: 186

h-index: 2

Hanshen Xiao

Citations: 3

h-index: 1

G. E. Suh

Citations: 3

h-index: 1

Sewoong Oh

Citations: 685

h-index: 9

Yulia Tsvetkov

Citations: 1,543

h-index: 15

Yejin Choi

Citations: 1,587

h-index: 12

개인 정보 보호가 필요한 데이터를 활용한 연구는 항상 데이터 부족이라는 제약에 직면해 왔으며, 이는 다른 분야와는 달리 데이터 규모의 확대로 이점을 얻는 분야와는 대조를 이룹니다. 이러한 문제는 OpenClaw 및 Gemini Agent와 같은 최신 AI 에이전트가 민감한 개인 정보를 지속적으로 접근할 수 있게 되면서 더욱 심각해지고 있습니다. 이러한 오랜 난관과 증가하는 위험을 해결하기 위해, 우리는 처음부터 완전히 합성된 1백만 규모의 데이터셋인 Privasis (개인 정보 보호의 안식처)를 제시합니다. Privasis는 다양한 개인 정보를 풍부하게 담고 있는 방대한 텍스트 데이터 저장소로, 민감한 사회 데이터를 처리해야 하는 분야의 연구를 확대하고 가속화하도록 설계되었습니다. 기존 데이터셋과 비교하여, 140만 건의 레코드로 구성된 Privasis는 품질과 규모 면에서 월등히 뛰어나며, 의료 기록, 법률 문서, 금융 기록, 달력, 문자 메시지 등 다양한 문서 유형에 걸쳐 훨씬 더 큰 다양성을 제공합니다. 또한, 인종, 생년월일, 직장 등 5510만 개의 주석이 달린 속성 정보를 포함하고 있습니다. 우리는 Privasis를 활용하여 텍스트 익명화 병렬 코퍼스를 구축하고, 텍스트를 분해하고 대상 익명화를 적용하는 파이프라인을 사용했습니다. 이 데이터셋으로 학습된 소형 익명화 모델(<=4B)은 GPT-5 및 Qwen-3 235B와 같은 최첨단 대규모 언어 모델보다 뛰어난 성능을 보입니다. 우리는 데이터, 모델 및 코드를 공개하여 개인 정보 보호 분야 및 에이전트에 대한 향후 연구를 가속화할 계획입니다.

Original Abstract

Research involving privacy-sensitive data has always been constrained by data scarcity, standing in sharp contrast to other areas that have benefited from data scaling. This challenge is becoming increasingly urgent as modern AI agents--such as OpenClaw and Gemini Agent--are granted persistent access to highly sensitive personal information. To tackle this longstanding bottleneck and the rising risks, we present Privasis (i.e., privacy oasis), the first million-scale fully synthetic dataset entirely built from scratch--an expansive reservoir of texts with rich and diverse private information--designed to broaden and accelerate research in areas where processing sensitive social data is inevitable. Compared to existing datasets, Privasis, comprising 1.4 million records, offers orders-of-magnitude larger scale with quality, and far greater diversity across various document types, including medical history, legal documents, financial records, calendars, and text messages with a total of 55.1 million annotated attributes such as ethnicity, date of birth, workplace, etc. We leverage Privasis to construct a parallel corpus for text sanitization with our pipeline that decomposes texts and applies targeted sanitization. Our compact sanitization models (<=4B) trained on this dataset outperform state-of-the-art large language models, such as GPT-5 and Qwen-3 235B. We plan to release data, models, and code to accelerate future research on privacy-sensitive domains and agents.

2 Citations

0 Influential

8 Altmetric

42.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!