2604.21251v1 Apr 23, 2026 cs.LG

CAP: LLM에서의 학습 해제(Unlearning)를 위한 제어 가능한 정렬 프롬프트

CAP: Controllable Alignment Prompting for Unlearning in LLMs

Mengjie Yang

Citations: 2

h-index: 1

Xun Chen

Citations: 418

h-index: 4

Jinyu Guo

Citations: 31

h-index: 4

Zhaokun Wang

Citations: 22

h-index: 3

Guangchun Luo

Citations: 2

h-index: 1

Jingwen Pu

Citations: 5

h-index: 1

Hongli Pu

Citations: 4

h-index: 1

Jie Ou

Citations: 123

h-index: 4

WenYi Li

Citations: 10

h-index: 2

Wenhong Tian

Citations: 127

h-index: 5

필터링되지 않은 데이터로 학습된 대규모 언어 모델(LLM)은 필연적으로 민감한 정보를 포함할 위험이 있으며, 이는 규제 준수 및 윤리적 안전을 위해 선택적인 지식 삭제(unlearning)를 필요로 합니다. 그러나 기존의 파라미터 수정 방법은 높은 계산 비용, 제어 불가능한 망각 범위, 그리고 모델 가중치 접근에 대한 엄격한 의존성과 같은 근본적인 한계를 가지고 있습니다. 이러한 제약 조건은 폐쇄형 모델에 적용하기 어렵게 만들며, 현재의 비침습적인 대안들은 체계적이지 않고 경험에 의존하는 경향이 있습니다. 이러한 문제점을 해결하기 위해, 우리는 end-to-end 프롬프트 기반 학습 해제 패러다임인 Controllable Alignment Prompting for Unlearning (CAP) 프레임워크를 제안합니다. CAP는 강화 학습을 통해 학습 가능한 프롬프트 최적화 과정을 통해 학습 해제를 분리하며, 프롬프트 생성기가 LLM과 협력하여 대상 지식을 억제하는 동시에 일반적인 능력을 선택적으로 보존합니다. 이 접근 방식은 프롬프트 제거를 통해 지식을 되돌릴 수 있도록 합니다. 광범위한 실험 결과, CAP는 모델 파라미터를 업데이트하지 않고도 정확하고 제어 가능한 학습 해제를 달성하며, 이전 방법의 전이성(transferability) 한계를 극복하는 동적 정렬 메커니즘을 구축함을 보여줍니다.

Original Abstract

Large language models (LLMs) trained on unfiltered corpora inherently risk retaining sensitive information, necessitating selective knowledge unlearning for regulatory compliance and ethical safety. However, existing parameter-modifying methods face fundamental limitations: high computational costs, uncontrollable forgetting boundaries, and strict dependency on model weight access. These constraints render them impractical for closed-source models, yet current non-invasive alternatives remain unsystematic and reliant on empirical experience. To address these challenges, we propose the Controllable Alignment Prompting for Unlearning (CAP) framework, an end-to-end prompt-driven unlearning paradigm. CAP decouples unlearning into a learnable prompt optimization process via reinforcement learning, where a prompt generator collaborates with the LLM to suppress target knowledge while preserving general capabilities selectively. This approach enables reversible knowledge restoration through prompt revocation. Extensive experiments demonstrate that CAP achieves precise, controllable unlearning without updating model parameters, establishing a dynamic alignment mechanism that overcomes the transferability limitations of prior methods.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!