2602.04448v1 Feb 04, 2026 cs.LG

RASA: 라우팅 인지 안전 정렬을 위한 전문가 모델

RASA: Routing-Aware Safety Alignment for Mixture-of-Experts Models

Tanqiu Jiang

Citations: 74

h-index: 4

Yuhui Wang

Citations: 240

h-index: 6

Jiacheng Liang

Citations: 138

h-index: 7

Ting Wang

Citations: 60

h-index: 4

혼합 전문가(MoE) 언어 모델은 희소 라우팅 메커니즘으로 인해 안전 정렬에 고유한 과제를 제시합니다. 이러한 메커니즘은 표준 전체 파라미터 미세 조정 시 degenerate한 최적화 동작을 유발할 수 있습니다. 초기 실험에서, MoE 모델에 전체 파라미터 안전 미세 조정을 무분별하게 적용하면 라우팅 또는 전문가 지배 효과를 통해 공격 성공률을 줄일 수 있지만, 이는 Safety-Critical 전문가를 직접적으로 수정하는 것이 아니라 라우팅을 이용한 우회 공격을 막는 효과를 가져올 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 Safety-Critical 전문가를 명시적으로 수정하고 라우팅 기반의 우회를 방지하는 라우팅 인지 전문가 수준 정렬 프레임워크인 RASA를 제안합니다. RASA는 성공적인 탈옥 공격에 과도하게 활성화되는 전문가를 식별하고, 고정된 라우팅 상태에서 이러한 전문가만 선택적으로 미세 조정하며, 이후 안전 정렬된 컨텍스트를 통해 라우팅 일관성을 강화합니다. 두 가지 대표적인 MoE 아키텍처와 다양한 탈옥 공격에 대해 RASA는 거의 완벽한 견고성, 강력한 교차 공격 일반화 능력, 그리고 MMLU, GSM8K, TruthfulQA와 같은 벤치마크에서 일반적인 능력을 유지하면서 상당한 과도한 거부 현상 감소를 달성했습니다. 우리의 결과는 강력한 MoE 안전 정렬이 전체 파라미터 업데이트보다는 대상 전문가 수정을 통해 더 큰 효과를 얻을 수 있으며, 이는 기존 접근 방식에 대한 실용적이고 아키텍처를 보존하는 대안을 제공한다는 것을 시사합니다.

Original Abstract

Mixture-of-Experts (MoE) language models introduce unique challenges for safety alignment due to their sparse routing mechanisms, which can enable degenerate optimization behaviors under standard full-parameter fine-tuning. In our preliminary experiments, we observe that naively applying full-parameter safety fine-tuning to MoE models can reduce attack success rates through routing or expert dominance effects, rather than by directly repairing Safety-Critical Experts. To address this challenge, we propose RASA, a routing-aware expert-level alignment framework that explicitly repairs Safety-Critical Experts while preventing routing-based bypasses. RASA identifies experts disproportionately activated by successful jailbreaks, selectively fine-tunes only these experts under fixed routing, and subsequently enforces routing consistency with safety-aligned contexts. Across two representative MoE architectures and a diverse set of jailbreak attacks, RASA achieves near-perfect robustness, strong cross-attack generalization, and substantially reduced over-refusal, while preserving general capabilities on benchmarks such as MMLU, GSM8K, and TruthfulQA. Our results suggest that robust MoE safety alignment benefits from targeted expert repair rather than global parameter updates, offering a practical and architecture-preserving alternative to prior approaches.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!