2604.16824v1 Apr 18, 2026 cs.CR

SafeDream: 능동적인 초기 탈옥 탐지를 위한 안전 세계 모델

SafeDream: Safety World Model for Proactive Early Jailbreak Detection

Weikai Lin

Citations: 106

h-index: 5

Yada Zhu

Citations: 40

h-index: 3

Song Wang

Citations: 9

h-index: 1

Bo Yan

Citations: 8

h-index: 1

멀티턴 탈옥 공격은 겉보기에 무해한 대화 단계에서 LLM의 안전 정렬을 점진적으로 약화시켜, 최첨단 모델에 대해 90% 이상의 성공률을 달성합니다. 기존의 정렬 기반 및 가드레일 방법은 세 가지 주요 제한점을 가지고 있습니다. 즉, 비용이 많이 드는 가중치 수정이 필요하며, 누적적인 안전 약화를 모델링하지 않고 각 단계를 독립적으로 평가하며, 유해한 콘텐츠가 생성된 후에만 공격을 탐지합니다. 이러한 제한점을 해결하기 위해, 우리는 먼저 LLM이 규정을 준수하기 전에 공격이 얼마나 빨리 탐지될 수 있는지를 측정하는 새로운 지표인 '탐지 선행'을 사용하여 능동적인 초기 탈옥 탐지 문제를 정의합니다. 그런 다음, LLM의 가중치를 수정하지 않고 외부 모듈로 작동하는 경량의 세계 모델 기반 프레임워크인 SAFEDREAM을 제안합니다. SAFEDREAM은 세 가지 구성 요소로 구성됩니다. (1) LLM의 숨겨진 상태를 압축된 안전 표현으로 인코딩하고 여러 단계에 걸쳐 어떻게 변화하는지 예측하는 안전 상태 세계 모델, (2) 약한 단계별 위험 신호를 신뢰할 수 있는 증거로 축적하는 CUSUM 탐지, (3) 공격 및 정상적인 미래를 잠재 공간에서 동시에 시뮬레이션하여 탈옥이 발생하기 전에 조기에 경고하는 대비적 상상. 우리는 세 가지 멀티턴 탈옥 벤치마크(XGuard-Train, SafeDialBench, SafeMTData)에서 8개의 기준 모델과 비교하여 SAFEDREAM이 모든 벤치마크에서 가장 빠른 탐지 정확도를 달성했습니다(LLM이 규정을 준수하기 1.06~1.20 단계 전에 탐지). 또한 경쟁력 있는 오탐율을 유지하고, 탐지 품질 측면에서 기준 모델보다 우수한 성능을 보였습니다.

Original Abstract

Multi-turn jailbreak attacks progressively erode LLM safety alignment across seemingly innocuous conversation turns, achieving success rates exceeding 90% against state-of-the-art models. Existing alignment-based and guardrail methods suffer from three key limitations: they require costly weight modification, evaluate each turn independently without modeling cumulative safety erosion, and detect attacks only after harmful content has been generated. To address these limitations, we first formulate the proactive early jailbreak detection problem with a new metric, detection lead, that measures how early an attack can be detected before the LLM complies. We then propose SAFEDREAM, a lightweight world-model-based framework that operates as an external module without modifying the LLM's weights. SAFEDREAM introduces three components: (1) a safety state world model that encodes LLM hidden states into a compact safety representation and predicts how it evolves across turns, (2) CUSUM detection that accumulates weak per-turn risk signals into reliable evidence, and (3) contrastive imagination that simultaneously rolls out attack and benign futures in latent space to issue early alarms before jailbreaks occur. On three multi-turn jailbreak benchmarks (XGuard-Train, SafeDialBench, SafeMTData) against 8 baselines, SAFEDREAM achieves the best detection timeliness across all benchmarks (1.06-1.20 turns before compliance) while maintaining competitive false positive rates and outperforming baselines in detection quality.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!