2603.16210v1 Mar 17, 2026 cs.AI

MOSAIC: 모듈형 제어 토큰을 활용한 조합 가능한 안전 정렬

MOSAIC: Composable Safety Alignment with Modular Control Tokens

Zhuoran Li

Citations: 0

h-index: 0

Xiangyu Zhao

Citations: 2

h-index: 1

Hongyu Chen

Citations: 25

h-index: 3

Jiancheng Dong

Citations: 3

h-index: 1

Wenxiao Li

Citations: 4

h-index: 1

대규모 언어 모델(LLM)의 안전 정렬은 일반적으로 모델 파라미터에 내장된 단일 정적 정책으로 구현됩니다. 그러나 실제 배포에서는 사용자, 지역 및 애플리케이션에 따라 달라지는 상황 의존적인 안전 규칙이 필요한 경우가 많습니다. 기존 접근 방식은 이러한 조건부 제어를 제공하는 데 어려움을 겪습니다. 파라미터 수준의 정렬은 안전 행동을 일반적인 기능과 얽히게 하고, 프롬프트 기반 방법은 약한 제어를 제공하는 자연어 지시에 의존합니다. 우리는 MOSAIC라는 모듈형 프레임워크를 제안합니다. MOSAIC는 학습 가능한 제어 토큰을 사용하여 동결된 기반 모델에서 조합 가능한 안전 정렬을 가능하게 합니다. 각 토큰은 안전 제약을 나타내며, 추론 시점에 유연하게 활성화되고 결합될 수 있습니다. 조합 토큰을 효율적으로 학습하기 위해, 우리는 순서 기반 작업 샘플링과 분포 수준의 정렬 목표를 도입하여 과도한 거부 현상을 완화합니다. 실험 결과, MOSAIC는 모델 유용성을 유지하면서 상당한 수준의 거부 현상을 줄인 강력한 방어 성능을 달성합니다.

Original Abstract

Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!