2604.14602v1 Apr 16, 2026 cs.CL

CausalDetox: 언어 모델의 유해 콘텐츠 제거를 위한 인과 관계 기반 어텐션 헤드 선택 및 개입

CausalDetox: Causal Head Selection and Intervention for Language Model Detoxification

Agam Goyal

University of Wisconsin-Madison

Citations: 380

h-index: 7

Yian Wang

Citations: 13

h-index: 2

Yuen Chen

Citations: 281

h-index: 4

Hari Sundaram

Citations: 64

h-index: 2

대규모 언어 모델(LLM)은 종종 유해한 콘텐츠를 생성하며, 이는 안전한 배포에 상당한 위험을 초래합니다. 현재의 완화 전략은 종종 생성 품질을 저하시키거나 비용이 많이 드는 인간의 주석을 필요로 합니다. 본 논문에서는 유해 콘텐츠 생성에 직접적으로 관여하는 특정 어텐션 헤드를 식별하고 개입하는 프레임워크인 CAUSALDETOX를 제안합니다. 필요성과 충분성의 확률(PNS)을 사용하여, 유해성 유발에 필수적이고 충분한 최소한의 헤드 집합을 식별합니다. 이러한 구성 요소를 다음 두 가지 상호 보완적인 전략을 통해 활용합니다. (1) 로컬 추론 시 개입: 문맥 인식 기반의 유해 콘텐츠 제거를 위한 입력별 동적 제어 벡터를 생성합니다. (2) PNS 기반 미세 조정: 유해한 표현을 영구적으로 제거합니다. 또한, 제어된 반사실적 평가를 가능하게 하는 정렬된 유해/무해 문장 쌍으로 구성된 새로운 벤치마크인 PARATOX를 소개합니다. ToxiGen, ImplicitHate 및 ParaDetox에 대한 실험 결과, CAUSALDETOX는 기준 모델보다 최대 5.34% 더 높은 유해 콘텐츠 감소 효과를 달성하면서도 언어적 유창성을 유지하며, 어텐션 헤드 선택 속도를 7배 향상시켰습니다.

Original Abstract

Large language models (LLMs) frequently generate toxic content, posing significant risks for safe deployment. Current mitigation strategies often degrade generation quality or require costly human annotation. We propose CAUSALDETOX, a framework that identifies and intervenes on the specific attention heads causally responsible for toxic generation. Using the Probability of Necessity and Sufficiency (PNS), we isolate a minimal set of heads that are necessary and sufficient for toxicity. We utilize these components via two complementary strategies: (1) Local Inference-Time Intervention, which constructs dynamic, input-specific steering vectors for context-aware detoxification, and (2) PNS-Guided Fine-Tuning, which permanently unlearns toxic representations. We also introduce PARATOX, a novel benchmark of aligned toxic/non-toxic sentence pairs enabling controlled counterfactual evaluation. Experiments on ToxiGen, ImplicitHate, and ParaDetox show that CAUSALDETOX achieves up to 5.34% greater toxicity reduction compared to baselines while preserving linguistic fluency, and offers a 7x speedup in head selection.

2 Citations

0 Influential

3.5 Altmetric

19.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!