2603.05772v1 Mar 06, 2026 cs.CR

심층 안전 어텐션 헤드로부터 거대 언어 모델을 탈옥시키는 방법: 뎁스 차지 (Depth Charge)

Depth Charge: Jailbreak Large Language Models from Deep Safety Attention Heads

Shiqian Zhao

Citations: 183

h-index: 6

Jinman Wu

Citations: 1

h-index: 1

Yi Xie

Citations: 18

h-index: 3

Xiaofeng Chen

Citations: 37

h-index: 4

현재 공개된 거대 언어 모델(OSLLM)은 뛰어난 생성 성능을 보여주고 있습니다. 그러나 모델의 구조와 가중치가 공개됨에 따라, 정렬 과정을 거치더라도 탈옥 공격에 취약할 수 있습니다. 기존의 공격은 주로 프롬프트 또는 임베딩과 같은 표면적인 수준에서 이루어지며, 종종 더 깊은 모델 구성 요소에 내재된 취약점을 드러내지 못하여, 성공적인 방어에 대한 잘못된 안심감을 불러일으킵니다. 본 논문에서는 심층이지만 충분히 정렬되지 않은 어텐션 헤드의 취약점을 탐색하는 어텐션 헤드 수준의 탈옥 프레임워크인 SAHA (Safety Attention Head Attack)를 제안합니다. SAHA는 두 가지 새로운 디자인을 포함합니다. 첫째, 심층 어텐션 레이어가 탈옥 공격에 더 취약하다는 것을 밝혀냈습니다. 이러한 발견을 바탕으로, SAHA는 안전하지 않은 출력을 생성하는 데 가장 중요한 레이어를 효과적으로 식별하기 위한 '어블레이션-임팩트 랭킹(Ablation-Impact Ranking)' 헤드 선택 전략을 도입합니다. 둘째, 어텐션에 대한 최소한의 변경으로 안전하지 않은 콘텐츠 생성을 탐색하기 위한 경계 인식 퍼터베이션 방법인 '레이어-와이즈 퍼터베이션(Layer-Wise Perturbation)'을 도입합니다. 이러한 제약 조건이 있는 퍼터베이션은 목표 의도와의 더 높은 의미적 관련성을 보장하면서 회피를 가능하게 합니다. 광범위한 실험 결과, SAHA는 SOTA 기반 모델 대비 ASR을 14% 향상시켜 어텐션 헤드에 존재하는 공격 표면의 취약점을 드러냈습니다. 저희의 코드는 https://anonymous.4open.science/r/SAHA 에서 이용하실 수 있습니다.

Original Abstract

Currently, open-sourced large language models (OSLLMs) have demonstrated remarkable generative performance. However, as their structure and weights are made public, they are exposed to jailbreak attacks even after alignment. Existing attacks operate primarily at shallow levels, such as the prompt or embedding level, and often fail to expose vulnerabilities rooted in deeper model components, which creates a false sense of security for successful defense. In this paper, we propose \textbf{\underline{S}}afety \textbf{\underline{A}}ttention \textbf{\underline{H}}ead \textbf{\underline{A}}ttack (\textbf{SAHA}), an attention-head-level jailbreak framework that explores the vulnerability in deeper but insufficiently aligned attention heads. SAHA contains two novel designs. Firstly, we reveal that deeper attention layers introduce more vulnerability against jailbreak attacks. Based on this finding, \textbf{SAHA} introduces \textit{Ablation-Impact Ranking} head selection strategy to effectively locate the most vital layer for unsafe output. Secondly, we introduce a boundary-aware perturbation method, \textit{i.e. Layer-Wise Perturbation}, to probe the generation of unsafe content with minimal perturbation to the attention. This constrained perturbation guarantees higher semantic relevance with the target intent while ensuring evasion. Extensive experiments show the superiority of our method: SAHA improves ASR by 14\% over SOTA baselines, revealing the vulnerability of the attack surface on the attention head. Our code is available at https://anonymous.4open.science/r/SAHA.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!