2602.08621v1 Feb 09, 2026 cs.LG

희소 모델, 희소 안전: 혼합 전문가 모델(MoE)에서 발견되는 위험한 경로

Sparse Models, Sparse Safety: Unsafe Routes in Mixture-of-Experts LLMs

Yukun Jiang

Citations: 153

h-index: 7

Hai Huang

Citations: 102

h-index: 3

Mingjie Li

Citations: 119

h-index: 6

Yage Zhang

Citations: 21

h-index: 3

Michael Backes

Citations: 1,142

h-index: 18

Yang Zhang

Citations: 777

h-index: 14

혼합 전문가(MoE) 아키텍처는 트랜스포머 레이어 내에서 선택적으로 전문가를 활성화하는 라우터를 도입하여, 특히 막대한 파라미터를 가진 모델에서 계산 비용을 크게 줄이면서도 경쟁력 있는 성능을 유지합니다. 그러나 기존 연구는 주로 유용성과 효율성에 초점을 맞추어 왔으며, 이 희소 아키텍처와 관련된 안전 위험은 충분히 탐구되지 않았습니다. 본 연구에서는 MoE LLM의 안전성이 그 아키텍처만큼 희소하다는 것을 보여줍니다. 즉, 안전한 출력을 유해한 출력으로 변환하는 '위험한 경로'(라우팅 구성)를 발견했습니다. 구체적으로, 우리는 각 레이어의 라우터의 안전 중요도를 정량화하는 '라우터 안전 중요도(RoSais)'를 도입했습니다. RoSais 값이 높은 라우터만 조작해도 기본 경로를 유해한 경로로 변경할 수 있습니다. 예를 들어, JailbreakBench에서 DeepSeek-V2-Lite 모델의 5개의 라우터를 마스킹하면 공격 성공률(ASR)이 4배 이상 증가하여 0.79에 도달하며, 이는 라우터 조작이 MoE LLM에서 자연스럽게 발생할 수 있는 잠재적 위험을 보여줍니다. 또한, 입력 토큰의 순차성과 동적 특성을 명시적으로 고려하는 '세분화된 토큰-레이어 기반 확률적 최적화(F-SOUR)' 프레임워크를 제안하여 더 구체적인 위험한 경로를 발견합니다. 4가지 대표적인 MoE LLM 계열에서, F-SOUR는 JailbreakBench와 AdvBench에서 각각 평균 ASR 0.90과 0.98을 달성했습니다. 마지막으로, 안전에 대한 인식을 갖춘 경로 비활성화 및 라우터 훈련과 같은 방어적 관점을 제시하며, 이는 MoE LLM을 보호하는 유망한 방법입니다. 본 연구가 MoE LLM의 향후 적대적 테스트 및 안전 확보에 기여하기를 바랍니다. 저희 코드는 https://github.com/TrustAIRLab/UnsafeMoE에서 제공됩니다.

Original Abstract

By introducing routers to selectively activate experts in Transformer layers, the mixture-of-experts (MoE) architecture significantly reduces computational costs in large language models (LLMs) while maintaining competitive performance, especially for models with massive parameters. However, prior work has largely focused on utility and efficiency, leaving the safety risks associated with this sparse architecture underexplored. In this work, we show that the safety of MoE LLMs is as sparse as their architecture by discovering unsafe routes: routing configurations that, once activated, convert safe outputs into harmful ones. Specifically, we first introduce the Router Safety importance score (RoSais) to quantify the safety criticality of each layer's router. Manipulation of only the high-RoSais router(s) can flip the default route into an unsafe one. For instance, on JailbreakBench, masking 5 routers in DeepSeek-V2-Lite increases attack success rate (ASR) by over 4$\times$ to 0.79, highlighting an inherent risk that router manipulation may naturally occur in MoE LLMs. We further propose a Fine-grained token-layer-wise Stochastic Optimization framework to discover more concrete Unsafe Routes (F-SOUR), which explicitly considers the sequentiality and dynamics of input tokens. Across four representative MoE LLM families, F-SOUR achieves an average ASR of 0.90 and 0.98 on JailbreakBench and AdvBench, respectively. Finally, we outline defensive perspectives, including safety-aware route disabling and router training, as promising directions to safeguard MoE LLMs. We hope our work can inform future red-teaming and safeguarding of MoE LLMs. Our code is provided in https://github.com/TrustAIRLab/UnsafeMoE.

4 Citations

0 Influential

37.047189562171 Altmetric

189.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!