2603.11388v1 Mar 12, 2026 cs.AI

거부 유발 요인 비활성화: 안전 정렬 과정에서의 과도한 거부 현상 이해 및 완화

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment

Zhiyu Xue

Citations: 27

h-index: 2

Zimo Qi

Citations: 18

h-index: 3

Guang-Da Liu

Citations: 88

h-index: 5

Bocheng Chen

Citations: 2

h-index: 1

Ramtin Pedarsani

Citations: 5,663

h-index: 32

안전 정렬은 대규모 언어 모델(LLM)이 유해한 요청에 대해 거부 응답을 생성하도록, 유해한 질문과 거부 응답 쌍으로 구성된 데이터를 활용하여 모델을 추가 학습시키는 것을 목표로 합니다. 안전 정렬은 업계에서 널리 사용되지만, 안전 정렬 후 학습된 모델이 유해하지 않은 질문까지 거부하는 과도한 거부 문제(

Original Abstract

Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries. Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning. Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.

0 Citations

0 Influential

16 Altmetric

80.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!