2602.01587v1 Feb 02, 2026 cs.CL

노이즈 증강 정렬을 통한 LLM 탈취 공격 방어 프레임워크: 증명 가능한 안전성 확보

Provable Defense Framework for LLM Jailbreaks via Noise-Augumented Alignment

Jianwei Yang

Citations: 1,653

h-index: 7

Zehua Cheng

Citations: 26

h-index: 3

Wei Dai

Citations: 15

h-index: 2

Jiahao Sun

Citations: 15

h-index: 2

대규모 언어 모델(LLM)은 여전히 GCG와 같은 경험적 방어 기법을 쉽게 우회하는 적응형 탈취 공격에 취약합니다. 본 연구에서는 안전성 보장을 단일 추론에서 앙상블의 통계적 안정성으로 전환하는, 증명 가능한 강건성을 위한 프레임워크를 제안합니다. 우리는 계층화된 랜덤 제거(Stratified Randomized Ablation)를 통해 인증된 의미론적 평활화(Certified Semantic Smoothing, CSS) 기법을 도입하며, 이를 통해 하이퍼기하 분포를 사용하여 엄격한 L-norm 보장을 얻을 수 있도록 입력 데이터를 불변 구조 프롬프트와 변경 가능한 페이로드로 분할합니다. 희소 컨텍스트에서의 성능 저하 문제를 해결하기 위해, 기본 모델을 의미론적 노이즈 제거기로 변환하는 노이즈 증강 정렬 튜닝(Noise-Augmented Alignment Tuning, NAAT)을 사용합니다. Llama-3 모델에 대한 광범위한 실험 결과, 제안하는 방법은 기울기 기반 공격의 성공률을 84.2%에서 1.2%로 감소시키면서, 문자 수준 기반의 기존 방법이 74.3%로 성능 저하를 보이는 반면, 94.1%의 정상적인 유용성을 유지하며 현저히 우수한 성능을 보입니다. 이 프레임워크는 안전성에 대한 결정적인 증명을 제공하며, 모델이 증명 가능한 범위 내의 모든 적대적 변형에 대해 강건성을 유지하도록 보장합니다.

Original Abstract

Large Language Models (LLMs) remain vulnerable to adaptive jailbreaks that easily bypass empirical defenses like GCG. We propose a framework for certifiable robustness that shifts safety guarantees from single-pass inference to the statistical stability of an ensemble. We introduce Certified Semantic Smoothing (CSS) via Stratified Randomized Ablation, a technique that partitions inputs into immutable structural prompts and mutable payloads to derive rigorous lo norm guarantees using the Hypergeometric distribution. To resolve performance degradation on sparse contexts, we employ Noise-Augmented Alignment Tuning (NAAT), which transforms the base model into a semantic denoiser. Extensive experiments on Llama-3 show that our method reduces the Attack Success Rate of gradient-based attacks from 84.2% to 1.2% while maintaining 94.1% benign utility, significantly outperforming character-level baselines which degrade utility to 74.3%. This framework provides a deterministic certificate of safety, ensuring that a model remains robust against all adversarial variants within a provable radius.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!