2601.03868v2 Jan 07, 2026 cs.CL

안전 정합성을 위해 중요한 것은 무엇인가?

What Matters For Safety Alignment?

Hui-Ling Zhen

Citations: 304

h-index: 6

Xing Li

Citations: 99

h-index: 4

Mingxuan Yuan

Citations: 139

h-index: 5

Xianzhi Yu

Citations: 46

h-index: 4

Lihao Yin

Citations: 21

h-index: 4

Zhenhua Dong

Citations: 2

h-index: 1

본 논문은 안전 정합성 능력에 대한 종합적인 실증 연구를 제시합니다. 우리는 LLM(Large Language Models)과 LRM(Large Retrieval Models)에서 안전 정합성에 중요한 요소들을 평가하여, 보다 안전하고 신뢰할 수 있는 AI 시스템 개발을 위한 필수적인 통찰력을 제공합니다. 우리는 여섯 가지 중요한 모델의 내재적 특성과 세 가지 외부 공격 기법의 영향을 체계적으로 조사하고 비교합니다. 당사의 대규모 평가는 13개의 다양한 모델 패밀리에 속하는 32개의 최근 인기 LLM 및 LRM을 사용하여 수행되었으며, 파라미터 규모는 3B에서 235B까지입니다. 평가는 5개의 기존 안전 데이터 세트를 활용하고, 56가지 탈옥 기법과 4가지 CoT(Chain-of-Thought) 공격 전략을 사용하여 모델의 취약점을 탐색하며, 총 460만 건의 API 호출을 수행했습니다. 우리의 주요 실증적 결과는 다음과 같습니다. 첫째, 우리는 GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, GPT-OSS-120B를 가장 안전한 상위 3개 모델로 식별했으며, 이는 견고한 안전 정합성을 위한 통합 추론 및 자기 성찰 메커니즘의 중요한 이점을 뒷받침합니다. 둘째, 사후 훈련 및 지식 증류는 안전 정합성을 체계적으로 저하시킬 수 있습니다. 따라서 우리는 안전이 일반적인 능력 추구를 위해 단순히 종속되는 것이 아니라, 이러한 단계에서 명시적인 제약 조건 또는 핵심 최적화 목표로 간주되어야 한다고 주장합니다. 셋째, 응답 접두사를 사용한 CoT 공격은 평균적으로 공격 성공률을 3.34배 증가시키고, Seed-OSS-36B-Instruct의 경우 0.6%에서 96.3%까지 증가시키는 뚜렷한 취약점을 드러냅니다. 이 중요한 발견은 텍스트 완성 인터페이스 및 LLM 서비스에서 사용자 정의 응답 접두사를 허용하는 기능에 내재된 안전 위험을 강조하며, 이에 대한 긴급한 아키텍처 및 배포 보호 장치가 필요함을 보여줍니다. 넷째, 역할극, 프롬프트 주입, 그리고 적대적 프롬프트를 위한 그래디언트 기반 검색은 현대 모델에서 정합되지 않은 동작을 유발하는 주요 방법론입니다.

Original Abstract

This paper presents a comprehensive empirical study on the safety alignment capabilities. We evaluate what matters for safety alignment in LLMs and LRMs to provide essential insights for developing more secure and reliable AI systems. We systematically investigate and compare the influence of six critical intrinsic model characteristics and three external attack techniques. Our large-scale evaluation is conducted using 32 recent, popular LLMs and LRMs across thirteen distinct model families, spanning a parameter scale from 3B to 235B. The assessment leverages five established safety datasets and probes model vulnerabilities with 56 jailbreak techniques and four CoT attack strategies, resulting in 4.6M API calls. Our key empirical findings are fourfold. First, we identify the LRMs GPT-OSS-20B, Qwen3-Next-80B-A3B-Thinking, and GPT-OSS-120B as the top-three safest models, which substantiates the significant advantage of integrated reasoning and self-reflection mechanisms for robust safety alignment. Second, post-training and knowledge distillation may lead to a systematic degradation of safety alignment. We thus argue that safety must be treated as an explicit constraint or a core optimization objective during these stages, not merely subordinated to the pursuit of general capability. Third, we reveal a pronounced vulnerability: employing a CoT attack via a response prefix can elevate the attack success rate by 3.34x on average and from 0.6% to 96.3% for Seed-OSS-36B-Instruct. This critical finding underscores the safety risks inherent in text-completion interfaces and features that allow user-defined response prefixes in LLM services, highlighting an urgent need for architectural and deployment safeguards. Fourth, roleplay, prompt injection, and gradient-based search for adversarial prompts are the predominant methodologies for eliciting unaligned behaviors in modern models.

0 Citations

0 Influential

3 Altmetric

15.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!