2602.11096v1 Feb 11, 2026 cs.CL

추론 모델의 안전성 회복은 단 몇 단계의 초기 지침만으로 가능합니다.

Safety Recovery in Reasoning Models Is Only a Few Early Steering Steps Away

Soumya Suvra Ghosal

National Institute of Technology Durgapur

Citations: 330

h-index: 11

Souradip Chakraborty

Citations: 1,204

h-index: 20

Vaibhav Singh

Citations: 31

h-index: 2

Furong Huang

Citations: 518

h-index: 11

Dinesh Manocha

Citations: 126

h-index: 5

A. S. Bedi

Citations: 2,253

h-index: 26

강화 학습(RL) 기반의 사후 훈련은 명시적인 연쇄적 사고(예: GRPO)를 통해 다중 모드 대규모 추론 모델(MLRM)의 추론 능력을 향상시킵니다. 그러나 최근 연구 결과에 따르면 이는 안전성 정렬을 저하시키고 탈옥 성공률을 높일 수 있습니다. 본 논문에서는 안전성 회복을 최대화의 목표가 아닌, 만족시키는 제약 조건으로 취급하는 경량 추론 시간 방어 기법인 SafeThink을 제안합니다. SafeThink은 안전 보상 모델을 사용하여 추론 과정을 모니터링하고, 안전성 임계값이 위반될 때에만 최적화된 짧은 수정 전구(예: "잠깐만요, 안전하게 생각하세요")를 조건적으로 주입합니다. 저희는 6개의 오픈 소스 MLRM과 4개의 탈옥 벤치마크(JailbreakV-28K, Hades, FigStep, MM-SafetyBench)에 대한 평가를 수행한 결과, SafeThink은 공격 성공률을 30-60% 감소시켰습니다(예: LlamaV-o1 모델의 JailbreakV-28K 벤치마크에서 63.33%에서 5.74%로 감소, R1-Onevision 모델의 Hades 벤치마크에서 69.07%에서 5.65%로 감소). 저희 실험에서 얻은 중요한 경험적 결과는 안전성 회복이 종종 추론 과정의 초기 단계에서 해결 가능하다는 것입니다. 일반적으로 처음 1-3단계의 추론 과정에 개입하는 것만으로도 전체 생성 과정을 안전한 결과로 유도할 수 있습니다.

Original Abstract

Reinforcement learning (RL) based post-training for explicit chain-of-thought (e.g., GRPO) improves the reasoning ability of multimodal large-scale reasoning models (MLRMs). But recent evidence shows that it can simultaneously degrade safety alignment and increase jailbreak success rates. We propose SafeThink, a lightweight inference-time defense that treats safety recovery as a satisficing constraint rather than a maximization objective. SafeThink monitors the evolving reasoning trace with a safety reward model and conditionally injects an optimized short corrective prefix ("Wait, think safely") only when the safety threshold is violated. In our evaluations across six open-source MLRMs and four jailbreak benchmarks (JailbreakV-28K, Hades, FigStep, and MM-SafetyBench), SafeThink reduces attack success rates by 30-60% (e.g., LlamaV-o1: 63.33% to 5.74% on JailbreakV-28K, R1-Onevision: 69.07% to 5.65% on Hades) while preserving reasoning performance (MathVista accuracy: 65.20% to 65.00%). A key empirical finding from our experiments is that safety recovery is often only a few steering steps away: intervening in the first 1-3 reasoning steps typically suffices to redirect the full generation toward safe completions.

0 Citations

0 Influential

13 Altmetric

65.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!