2605.01899v1 May 03, 2026 cs.AI

의도와 역할을 분리하는 방법: 페르소나 불변 안전 정렬을 위한 적대적 자기 학습

Disentangling Intent from Role: Adversarial Self-Play for Persona-Invariant Safety Alignment

Zhongtian Ma

Citations: 69

h-index: 3

Qiaosheng Zhang

Citations: 58

h-index: 4

Xiaoyu Wen

Shanghai Jiao Tong University

Citations: 118

h-index: 5

Zhen Wang

Citations: 105

h-index: 4

Shuyue Hu

Citations: 219

h-index: 8

Jiajia Li

Citations: 37

h-index: 3

대규모 언어 모델(LLM)의 기능이 발전하면서 다양한 분야에서 널리 활용되고 있으며, 잠재적으로 위험한 시나리오에서도 사용되고 있습니다. 안전 정렬 기술이 발전했음에도 불구하고, 현재 모델은 여전히 새로운 페르소나 기반 공격에 취약합니다. 기존의 페르소나 기반 공격 연구는 주로 공격 반복에 초점을 맞추고 있지만, 방어 측면에서는 체계적이고 메커니즘적인 제약이 부족합니다. 이러한 문제를 해결하기 위해, 본 연구에서는 공격 측에서 페르소나 계통 진화(Persona Lineage Evolution, PLE)를 통해 공동 진화를 달성하고, 방어 측에서는 페르소나 불변 일관성 학습(Persona-Invariant Consistency Learning, PICL)을 적용하는 적대적 자기 학습 프레임워크인 페르소나 불변 정렬(Persona-Invariant Alignment, PIA)을 제안합니다. 이론적으로, PICL은 구조적 분리 가설에 기반하여, 일방향 KL-발산 제약을 사용하여 안전 결정이 페르소나 컨텍스트와 구조적으로 분리되도록 함으로써, 페르소나 기반 공격에 대한 안전한 동작을 유지합니다. 실험 결과는 PLE가 계통 기반의 신용 전파를 활용하여 고위험 페르소나 공간을 효율적으로 탐색한다는 것을 보여줍니다. 또한, PICL 방어 방법은 모델의 일반적인 능력을 유지하면서 공격 성공률(Attack Success Rate, ASR)을 크게 감소시켜, 본 정렬 패러다임의 우수성과 견고성을 입증합니다. 관련 코드는 https://github.com/JiajiaLi-1130/PIA 에서 확인할 수 있습니다.

Original Abstract

The growing capabilities of large language models (LLMs) have driven their widespread deployment across diverse domains, even in potentially high-risk scenarios. Despite advances in safety alignment techniques, current models remain vulnerable to emerging persona-based jailbreak attacks. Existing research on persona-based jailbreak has primarily focused on attack iterations, yet it lacks systemic and mechanistic constraints on the defense side. To address this challenge, we propose Persona-Invariant Alignment (PIA), an adversarial self-play framework that achieves co-evolution through Persona Lineage Evolution (PLE) on the attack side and Persona-Invariant Consistency Learning (PICL) on the defense side. Theoretically, PICL is grounded in the structural separation hypothesis, using a unilateral KL-divergence constraint to enable the structural decoupling of safety decisions from persona context, thereby maintaining safe behavior under persona-based jailbreak attacks. Experimental results demonstrate that PLE efficiently explores high-risk persona spaces by leveraging lineage-based credit propagation. Meanwhile, the PICL defense method significantly reduces the Attack Success Rate (ASR) while preserving the model's general capability, thereby validating the superiority and robustness of this alignment paradigm. Codes are available at https://github.com/JiajiaLi-1130/PIA.

0 Citations

0 Influential

27.4657359028 Altmetric

137.3 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!