2601.08623v1 Jan 13, 2026 cs.CV

SafeRedir: 프롬프트 임베딩 리디렉션을 통한 이미지 생성 모델의 강력한 안전한 학습 제거

SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Renyang Liu

Citations: 124

h-index: 6

Kangjie Chen

Nanyang Technological University

Citations: 864

h-index: 9

Han Qiu

Citations: 415

h-index: 10

Jie Zhang

Citations: 257

h-index: 8

Kwok-Yan Lam

Citations: 131

h-index: 4

Tianwei Zhang

Citations: 34

h-index: 3

See-kiong Ng

Citations: 12

h-index: 3

이미지 생성 모델(IGM)은 인상적이고 창의적인 콘텐츠를 생성하는 데 능숙하지만, 종종 학습 데이터에서 원치 않는 다양한 개념을 기억하여, 불쾌하거나 저작권 침해에 해당하는 콘텐츠를 재생산하는 경우가 있습니다. 이러한 현상은 실제 환경에서의 안전 및 규정 준수 위험을 야기하며, 기존의 사후 필터링으로는 충분히 해결하기 어렵습니다. 이는 필터링 메커니즘의 제한적인 견고성 및 세분화된 의미 제어의 부족 때문입니다. 최근의 학습 제거 방법은 모델 수준에서 유해한 개념을 제거하려고 시도하지만, 비용이 많이 드는 재학습이 필요하거나, 안전한 콘텐츠 생성 품질이 저하되거나, 프롬프트 재구성 및 적대적 공격에 취약하다는 한계점을 가지고 있습니다. 이러한 문제를 해결하기 위해, 우리는 프롬프트 임베딩 리디렉션을 통한 강력한 학습 제거를 위한 경량의 추론 시간 프레임워크인 SafeRedir를 제안합니다. SafeRedir는 기본 IGM을 수정하지 않고, 토큰 수준의 개입을 통해 안전하지 않은 프롬프트를 안전한 의미 영역으로 적응적으로 리디렉션합니다. 이 프레임워크는 두 가지 핵심 구성 요소로 구성됩니다. 첫째, 안전하지 않은 생성 경로를 식별하기 위한 잠재 변수를 고려한 다중 모드 안전 분류기입니다. 둘째, 정확한 의미 리디렉션을 위한 토큰 수준 델타 생성기로, 토큰 마스킹 및 적응적 스케일링을 위한 보조 예측기를 포함하여 개입을 조정합니다. 여러 대표적인 학습 제거 작업에 대한 실험 결과는 SafeRedir가 효과적인 학습 제거 능력, 높은 의미 및 인식 보존, 견고한 이미지 품질, 그리고 적대적 공격에 대한 향상된 저항성을 달성함을 보여줍니다. 또한, SafeRedir는 다양한 확산 기반 모델 및 기존 학습 제거 모델에 효과적으로 적용되어, 플러그 앤 플레이 호환성 및 광범위한 적용 가능성을 검증합니다. 코드 및 데이터는 https://github.com/ryliu68/SafeRedir 에서 확인할 수 있습니다.

Original Abstract

Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.

3 Citations

1 Influential

31.931471805599 Altmetric

164.7 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!