2604.25716v1 Apr 28, 2026 cs.CL

의미론적 코드북을 활용한 다국어 탈어제 탐지

Cross-Lingual Jailbreak Detection via Semantic Codebooks

Sabrina Sadiekh

Citations: 1

h-index: 1

Shirin Alanova

Citations: 0

h-index: 0

Bogdan Minko

Citations: 0

h-index: 0

Evgeniy Kokuykin

Citations: 0

h-index: 0

대규모 언어 모델(LLM)의 안전 장치는 주로 영어 중심적으로 설계되어 있어 다국어 환경에서의 배포 시 체계적인 취약점을 야기한다. 기존 연구에 따르면 악성 프롬프트를 다른 언어로 번역하면 탈어제 성공률이 크게 증가하며, 이는 다국어 환경에서의 구조적인 보안 격차를 드러낸다. 본 연구에서는 재학습이나 언어별 특화 없이, 언어에 독립적인 의미론적 유사성을 활용하여 이러한 공격을 완화할 수 있는지 조사한다. 우리는 다국어 쿼리 임베딩을 고정된 영어 탈어제 프롬프트 코드북과 비교하는 방식을 사용하며, 이는 훈련 없이 작동하는 블랙박스 LLM을 위한 외부 안전 장치 역할을 한다. 우리는 네 가지 언어, 두 가지 번역 파이프라인, 네 가지 안전성 벤치마크, 세 가지 임베딩 모델, 그리고 세 가지 대상 LLM(Qwen, Llama, GPT-3.5)에 대한 체계적인 평가를 수행했다. 연구 결과는 다국어 전이의 두 가지 상이한 패턴을 보여준다. 널리 사용되는 탈어제 템플릿을 포함하는 벤치마크에서는 의미론적 유사성이 언어 간에 안정적으로 일반화되어 거의 완벽한 분리 성능(AUC 최대 0.99)을 달성하며, 엄격한 낮은 오탐율 제약 조건 하에서 탈어제 성공률을 크게 감소시킨다. 그러나 데이터 분포의 변화가 발생하는 경우, 즉 행동적으로 다양하고 이질적인 안전하지 않은 벤치마크에서는 분리 성능이 현저하게 저하(AUC 약 0.60-0.70)되며, 보안적으로 중요한 낮은 오탐율(low-FPR) 영역에서 모든 임베딩 모델의 탐지율이 감소한다.

Original Abstract

Safety mechanisms for large language models (LLMs) remain predominantly English-centric, creating systematic vulnerabilities in multilingual deployment. Prior work shows that translating malicious prompts into other languages can substantially increase jailbreak success rates, exposing a structural cross-lingual security gap. We investigate whether such attacks can be mitigated through language-agnostic semantic similarity without retraining or language-specific adaptation. Our approach compares multilingual query embeddings against a fixed English codebook of jailbreak prompts, operating as a training-free external guardrail for black-box LLMs. We conduct a systematic evaluation across four languages, two translation pipelines, four safety benchmarks, three embedding models, and three target LLMs (Qwen, Llama, GPT-3.5). Our results reveal two distinct regimes of cross-lingual transfer. On curated benchmarks containing canonical jailbreak templates, semantic similarity generalizes reliably across languages, achieving near-perfect separability (AUC up to 0.99) and substantial reductions in absolute attack success rates under strict low-false-positive constraints. However, under distribution shift - on behaviorally diverse and heterogeneous unsafe benchmarks - separability degrades markedly (AUC $\approx$ 0.60-0.70), and recall in the security-critical low-FPR regime drops across all embedding models.

0 Citations

0 Influential

0.5 Altmetric

2.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!