2605.05662v1 May 07, 2026 cs.CL

XL-SafetyBench: LLM의 안전성과 문화적 민감성을 평가하기 위한 국가 기반의 다문화 벤치마크

XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity

Brigitta Jesica Kartono

Citations: 3

h-index: 1

Helena Berndt

Citations: 3

h-index: 1

Haon Park

Citations: 347

h-index: 2

Amanda Minnich

Citations: 126

h-index: 7

Dasol Choi

Citations: 5

h-index: 1

Eugenia Kim

Citations: 34

h-index: 2

Jae-won Noh

Citations: 0

h-index: 0

Sanghyun Seo

Citations: 22

h-index: 2

Eunmi Kim

Citations: 1

h-index: 1

M. Oh

Citations: 38

h-index: 3

Yunjin Park

Citations: 0

h-index: 0

Josef Pichlmeier

Citations: 221

h-index: 5

Sai Krishna Mendu

Citations: 12

h-index: 2

Glenn Johannes Tungka

Citations: 0

h-index: 0

Ozlem Gokcce

Citations: 0

h-index: 0

Suresh Gehlot

Citations: 0

h-index: 0

K. Pratt

Citations: 0

h-index: 0

현재의 LLM 안전성 벤치마크는 주로 영어 중심이며, 종종 번역에 의존하여 국가별 특정 위험을 제대로 반영하지 못합니다. 또한, 이러한 벤치마크는 모델이 보편적인 위험과 구별되는 문화적 민감성을 감지하는 능력을 거의 평가하지 않습니다. 본 논문에서는 10개의 국가-언어 쌍에 걸쳐 5,500개의 테스트 케이스로 구성된 XL-SafetyBench를 소개합니다. XL-SafetyBench는 국가별 특성을 반영한 적대적 프롬프트로 구성된 Jailbreak 벤치마크와, 무해한 요청 내에 지역적 민감성을 내재화한 Cultural 벤치마크로 구성됩니다. 각 항목은 LLM 지원 탐색, 자동 검증 시스템, 그리고 각 국가별 독립적인 원어민 평가자 2명의 다단계 파이프라인을 통해 구성됩니다. 모델의 원칙적인 거부 반응과 이해 부족을 구별하기 위해, 공격 성공률(ASR)과 함께 Neutral-Safe Rate (NSR) 및 Cultural Sensitivity Rate (CSR)라는 두 가지 새로운 지표를 사용하여 평가했습니다. 10개의 최첨단 모델과 27개의 지역 모델을 평가한 결과, 두 가지 주요 결과가 나타났습니다. 첫째, 최첨단 모델의 경우, Jailbreak 방어 능력과 문화적 인식 사이에는 뚜렷한 상관관계가 나타나지 않아, 통합된 안전성 점수는 각 지표별 변동성을 가립니다. 둘째, 지역 모델은 ASR과 NSR 사이에 거의 선형적인 상관관계(r = -0.81)를 보이며, 이는 모델의 안전성이 진정한 정렬보다는 생성 실패에 기인한다는 것을 시사합니다. XL-SafetyBench는 다국어 시대에 보다 미세하고 문화 간 안전성 평가를 가능하게 합니다.

Original Abstract

Current LLM safety benchmarks are predominantly English-centric and often rely on translation, failing to capture country-specific harms. Moreover, they rarely evaluate a model's ability to detect culturally embedded sensitivities as distinct from universal harms. We introduce XL-SafetyBench. a suite of 5,500 test cases across 10 country-language pairs, comprising a Jailbreak Benchmark of country-grounded adversarial prompts and a Cultural Benchmark where local sensitivities are embedded within innocuous requests. Each item is constructed via a multi-stage pipeline that combines LLM-assisted discovery, automated validation gates, and dual independent native-speaker annotators per country. To distinguish principled refusal from comprehension failure, we evaluate Attack Success Rate (ASR) alongside two complementary metrics we introduce: Neutral-Safe Rate (NSR) and Cultural Sensitivity Rate (CSR). Evaluating 10 frontier and 27 local LLMs reveals two key findings. First, jailbreak robustness and cultural awareness do not show a coupled relationship among frontier models, so a composite safety score obscures per-axis variation. Second, local models exhibit a near-linear ASR-NSR trade-off (r = -0.81), indicating that their apparent safety reflects generation failure rather than genuine alignment. XL-SafetyBench enables more nuanced, cross-cultural safety evaluation in the multilingual era.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!