거부 유발 요인 비활성화: 안전 정렬 과정에서의 과도한 거부 현상 이해 및 완화
Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
안전 정렬은 대규모 언어 모델(LLM)이 유해한 요청에 대해 거부 응답을 생성하도록, 유해한 질문과 거부 응답 쌍으로 구성된 데이터를 활용하여 모델을 추가 학습시키는 것을 목표로 합니다. 안전 정렬은 업계에서 널리 사용되지만, 안전 정렬 후 학습된 모델이 유해하지 않은 질문까지 거부하는 과도한 거부 문제(
Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers. Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications. In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries. However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries. Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning. Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.
No Analysis Report Yet
This paper hasn't been analyzed by Gemini yet.
Log in to request an AI analysis.