2604.07835v1 Apr 09, 2026 cs.AI

경계 장치를 무력화시키다: 동적 컨텍스트 표현 제거를 통한 추론 시간 기반 제어 우회 공격

Silencing the Guardrails: Inference-Time Jailbreaking via Dynamic Contextual Representation Ablation

Changting Lin

Citations: 216

h-index: 8

Wenpeng Xing

Citations: 205

h-index: 10

Moran Fang

Citations: 0

h-index: 0

Meng Han

Citations: 50

h-index: 4

Guangtai Wang

Citations: 32

h-index: 3

대규모 언어 모델(LLM)은 놀라운 성능을 달성했지만, 여전히 안전 제약을 우회하는 제어 우회 공격에 취약합니다. 기존의 전략들은 휴리스틱 기반 프롬프트 엔지니어링부터 계산 비용이 많이 드는 최적화까지 다양하지만, 효과성과 효율성 사이의 상당한 트레이드오프를 겪는 경우가 많습니다. 본 연구에서는 Contextual Representation Ablation (CRA)이라는 새로운 추론 시간 개입 프레임워크를 제안합니다. CRA는 모델의 숨겨진 상태 내에서 거부 행동을 매개하는 특정 저랭크 부분 공간이라는 기하학적 통찰력을 기반으로 설계되었으며, 모델의 안전 장치를 동적으로 무력화합니다. CRA는 비용이 많이 드는 파라미터 업데이트나 학습 없이, 디코딩 과정에서 거부 유발 활성화 패턴을 식별하고 억제합니다. 여러 안전 정렬 오픈 소스 LLM에 대한 실험적 평가 결과, CRA는 기존 방법보다 훨씬 뛰어난 성능을 보였습니다. 이러한 결과는 현재의 정렬 메커니즘의 근본적인 취약점을 드러냅니다. 즉, 안전 제약 조건은 내부 표현에서 선택적으로 제거될 수 있으며, 이는 모델의 잠재 공간을 보호하는 더욱 강력한 방어책의 시급한 필요성을 강조합니다.

Original Abstract

While Large Language Models (LLMs) have achieved remarkable performance, they remain vulnerable to jailbreak attacks that circumvent safety constraints. Existing strategies, ranging from heuristic prompt engineering to computationally intensive optimization, often face significant trade-offs between effectiveness and efficiency. In this work, we propose Contextual Representation Ablation (CRA), a novel inference-time intervention framework designed to dynamically silence model guardrails. Predicated on the geometric insight that refusal behaviors are mediated by specific low-rank subspaces within the model's hidden states, CRA identifies and suppresses these refusal-inducing activation patterns during decoding without requiring expensive parameter updates or training. Empirical evaluation across multiple safety-aligned open-source LLMs demonstrates that CRA significantly outperforms baselines. These results expose the intrinsic fragility of current alignment mechanisms, revealing that safety constraints can be surgically ablated from internal representations, and underscore the urgent need for more robust defenses that secure the model's latent space.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!