2602.13547v1 Feb 14, 2026 cs.CR

AISA: 대규모 언어 모델에서 자이루프 공격에 대한 내재적 안전 인지 능력 활성화

AISA: Awakening Intrinsic Safety Awareness in Large Language Models against Jailbreak Attacks

Ruiping Yin

Citations: 0

h-index: 0

Xuan Xie

Citations: 9

h-index: 3

Wei Song

Citations: 36

h-index: 4

대규모 언어 모델(LLM)은 여전히 유해하거나 정책 위반적인 출력을 유도하는 자이루프 프롬프트에 취약하며, 많은 기존 방어 방법은 비용이 많이 드는 미세 조정, 프롬프트 재작성 또는 외부 안전 장치에 의존하여 지연을 발생시키고 유용성을 저하시킬 수 있습니다. 본 연구에서는 안전을 추가 기능으로 취급하는 대신, 모델 내에 잠재되어 있는 안전 행동을 활성화하는 가볍고 단일 단계 방어 방법인 AISA를 제안합니다. AISA는 시공간 분석을 통해 내재적 안전 인지 능력을 먼저 파악하며, 의도 식별 신호가 광범위하게 인코딩되어 있으며, 특히 생성 전에 특정 어텐션 헤드의 스케일링된 점곱 출력에서 강한 분리도가 나타나는 것을 보여줍니다. AISA는 자동으로 선택된 소규모 헤드 집합을 사용하여 해석 가능한 프롬프트 위험 점수를 추출하며, 오버헤드가 거의 없이 70억 개의 매개변수를 가진 작은 모델에서 강력한 독점 모델과 경쟁하는 수준의 감지 성능을 달성합니다. AISA는 로짓 수준에서 제어를 수행합니다. 추론된 위험에 비례하여 디코딩 분포를 조절하여, 양성 프롬프트의 경우 정상적인 생성을 수행하고, 고위험 요청의 경우 적절한 거부를 수행합니다. 모델 매개변수를 변경하거나, 보조 모듈을 추가하거나, 다중 단계 추론을 필요로 하지 않습니다. 13개의 데이터 세트, 12개의 LLM 및 14개의 기준 모델을 대상으로 실시한 광범위한 실험 결과, AISA는 견고성 및 일반화 성능을 향상시키면서 유용성을 유지하고 오탐을 줄여, 약하게 정렬된 또는 의도적으로 위험한 모델 변형에서도 안전한 배포를 가능하게 합니다.

Original Abstract

Large language models (LLMs) remain vulnerable to jailbreak prompts that elicit harmful or policy-violating outputs, while many existing defenses rely on expensive fine-tuning, intrusive prompt rewriting, or external guardrails that add latency and can degrade helpfulness. We present AISA, a lightweight, single-pass defense that activates safety behaviors already latent inside the model rather than treating safety as an add-on. AISA first localizes intrinsic safety awareness via spatiotemporal analysis and shows that intent-discriminative signals are broadly encoded, with especially strong separability appearing in the scaled dot-product outputs of specific attention heads near the final structural tokens before generation. Using a compact set of automatically selected heads, AISA extracts an interpretable prompt-risk score with minimal overhead, achieving detector-level performance competitive with strong proprietary baselines on small (7B) models. AISA then performs logits-level steering: it modulates the decoding distribution in proportion to the inferred risk, ranging from normal generation for benign prompts to calibrated refusal for high-risk requests -- without changing model parameters, adding auxiliary modules, or requiring multi-pass inference. Extensive experiments spanning 13 datasets, 12 LLMs, and 14 baselines demonstrate that AISA improves robustness and transfer while preserving utility and reducing false refusals, enabling safer deployment even for weakly aligned or intentionally risky model variants.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!