2601.23143v1 Jan 30, 2026 cs.AI

THINKSAFE: 추론 모델을 위한 자체 생성 안전 정렬 프레임워크

THINKSAFE: Self-Generated Safety Alignment for Reasoning Models

Sung Ju Hwang

Citations: 157

h-index: 6

Seanie Lee

KAIST

Citations: 896

h-index: 16

Sangwoo Park

Citations: 13

h-index: 3

Yumin Choi

Citations: 24

h-index: 3

Gyeongman Kim

Citations: 180

h-index: 5

Minki Kang

Citations: 923

h-index: 15

Jihun Yun

Citations: 15

h-index: 3

Dongmin Park

Citations: 25

h-index: 3

Jongho Park

Citations: 25

h-index: 3

대규모 추론 모델(LRM)은 강화 학습(RL)을 활용하여 추론 작업을 수행하고 긴 연쇄적 사고(CoT) 추론을 생성함으로써 놀라운 성능을 달성합니다. 그러나 이러한 과도한 최적화는 종종 규정 준수를 우선시하여 모델이 악의적인 프롬프트에 취약하게 만듭니다. 이러한 안전성 저하를 완화하기 위해 최근 연구에서는 외부 교사 모델을 사용하는 방법을 사용하지만, 이는 원래 추론 능력을 저하시키는 분포 불일치를 야기합니다. 본 논문에서는 외부 교사 모델 없이 안전성을 회복하는 자체 생성 정렬 프레임워크인 ThinkSafe를 제안합니다. 핵심적인 아이디어는 규정 준수가 안전 메커니즘을 억제하지만, 모델은 종종 잠재적으로 유해성을 식별하는 지식을 보유하고 있다는 것입니다. ThinkSafe는 경량적인 거부 지향(refusal steering)을 통해 이러한 잠재력을 활용하여 모델이 안전 관련 추론 과정을 생성하도록 유도합니다. 이러한 자체 생성된 응답을 사용하여 모델을 미세 조정하면 모델을 효과적으로 재정렬하면서 분포 변화를 최소화할 수 있습니다. DeepSeek-R1-Distill 및 Qwen3 모델에 대한 실험 결과, ThinkSafe는 안전성을 크게 향상시키면서 추론 능력을 유지하는 것으로 나타났습니다. 특히, ThinkSafe는 GRPO와 비교하여 더 우수한 안전성과 유사한 추론 성능을 제공하며, 계산 비용은 훨씬 적게 듭니다. 코드, 모델, 데이터셋은 https://github.com/seanie12/ThinkSafe.git 에서 확인할 수 있습니다.

Original Abstract

Large reasoning models (LRMs) achieve remarkable performance by leveraging reinforcement learning (RL) on reasoning tasks to generate long chain-of-thought (CoT) reasoning. However, this over-optimization often prioritizes compliance, making models vulnerable to harmful prompts. To mitigate this safety degradation, recent approaches rely on external teacher distillation, yet this introduces a distributional discrepancy that degrades native reasoning. We propose ThinkSafe, a self-generated alignment framework that restores safety alignment without external teachers. Our key insight is that while compliance suppresses safety mechanisms, models often retain latent knowledge to identify harm. ThinkSafe unlocks this via lightweight refusal steering, guiding the model to generate in-distribution safety reasoning traces. Fine-tuning on these self-generated responses effectively realigns the model while minimizing distribution shift. Experiments on DeepSeek-R1-Distill and Qwen3 show ThinkSafe significantly improves safety while preserving reasoning proficiency. Notably, it achieves superior safety and comparable reasoning to GRPO, with significantly reduced computational cost. Code, models, and datasets are available at https://github.com/seanie12/ThinkSafe.git.

3 Citations

1 Influential

28 Altmetric

145.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!