2605.05678v1 May 07, 2026 cs.AI

위험 연쇄: 대규모 추론 모델의 안전성 실패와 적응형 다원칙 지침을 통한 완화

Chain of Risk: Safety Failures in Large Reasoning Models and Mitigation via Adaptive Multi-Principle Steering

Yunhan Zhao

Citations: 259

h-index: 6

Jian Hou

Citations: 25

h-index: 3

Zhiwei Zhang

Citations: 73

h-index: 4

Taoran Li

Citations: 5

h-index: 1

Binghan Lu

Citations: 1

h-index: 1

Bing Hu

Citations: 17

h-index: 2

Yuexing Hao

Citations: 17

h-index: 3

Xiaomin Li

Citations: 129

h-index: 6

Zheyuan Deng

Citations: 4

h-index: 1

대규모 추론 모델(LRM)은 투명성, 검증 및 의도적인 문제 해결을 위해 체인 오브 소트(chain-of-thought)와 유사한 추론 과정을 점점 더 많이 노출하고 있습니다. 이는 안전상의 취약점을 야기합니다. 최종 답변이 안전해 보이더라도 추론 과정에 유해하거나 정책 위반에 해당하는 내용이 나타날 수 있습니다. 본 연구에서는 20가지 안전 원칙을 기반으로 추론 과정과 최종 답변 단계를 모두 평가하여, 최종 답변의 안전성이 전체 추론-답변 경로의 완전한 지표가 될 수 있는지 조사했습니다. 7개의 공개적인 유해성 및 탈옥(jailbreak) 데이터 소스, 그리고 4개의 일반화된 데이터 소스에서 추출한 프롬프트를 사용하여, 15개의 오픈 소스 및 API 기반 LRM 모델을 41,000개의 프롬프트로 평가했습니다. 추론 과정은 일관되게 최종 답변보다 더 많은 안전성 위험을 드러냈습니다. 특히, 심각한 단계별 실패 사례인 '유출(leak)' (안전한 답변 앞에 유해한 추론이 나타나는 경우) 및 '탈출(escape)' (안전한 답변 앞에 무해해 보이는 추론이 나타나는 경우) 사례가 두드러졌습니다. 원칙 수준의 분석 결과, 위험은 허위 정보, 법률 준수, 차별, 신체적 해악 및 심리적 해악 영역에 집중되어 있음을 확인했습니다. 또한, 본 연구에서는 '적응형 다원칙 지침(adaptive multi-principle steering)'이라는 새로운 완화 방법을 제안합니다. 이는 각 안전 원칙별로 유해-안전 활성화 방향을 학습하고, 현재 숨겨진 상태가 유해 중심에 더 가까운 경우에만 해당 방향을 활성화하는 테스트 시간 기반의 white-box 방법입니다. 세 개의 지침을 적용할 수 있는 오픈 소스 추론 모델에서, 적응형 지침은 홀드아웃(held-out) 및 일반화된 벤치마크에서 추론 과정과 최종 답변 모두에서 유해 사례의 수를 감소시켰습니다. DeepSeek-R1-Qwen-7B 모델은 BBH, GSM8K 및 MMLU 데이터셋에서 평균적으로 유해 사례 수를 40.8% 감소시키면서도 정확도를 97.7%로 유지했습니다. 이러한 결과는 LRM의 안전성을 평가하고 완화할 때, 최종 답변 단계뿐만 아니라 전체 추론-답변 경로를 고려해야 함을 시사합니다.

Original Abstract

Large reasoning models (LRMs) increasingly expose chain-of-thought-like reasoning for transparency, verification, and deliberate problem solving. This creates a safety blind spot: harmful or policy-violating content may appear in reasoning traces even when final answers appear safe. We test whether final-answer safety is a sufficient proxy for the full reasoning-answer trajectory by scoring both stages under a unified twenty-principle safety rubric. Using prompts from seven public harmfulness and jailbreak sources, plus four out-of-distribution (OOD) sources, we evaluate 15 open-weight and API-based LRMs across 41K prompts per model. Reasoning traces consistently reveal additional safety risks beyond final answers, especially in high-severity stage-wise failures: leak cases, where unsafe reasoning precedes a safe-looking answer, and escape cases, where benign-looking reasoning precedes an unsafe final response. Principle-level analysis shows that risk concentrates in misinformation, legal compliance, discrimination, physical harm, and psychological harm. We further propose adaptive multi-principle steering, a white-box test-time mitigation that learns one unsafe-to-safe activation direction per safety principle and activates only directions whose current hidden state is closer to the unsafe than safe centroid. On three steerable open reasoning models, adaptive steering reduces unsafe counts in both reasoning traces and final answers on held-out and OOD benchmarks. DeepSeek-R1-Qwen-7B achieves a 40.8% average unsafe-count reduction while retaining 97.7% macro-averaged accuracy on BBH, GSM8K, and MMLU. These results suggest that LRM safety should be evaluated and mitigated over the full exposed reasoning-answer trajectory, not only at the final-answer stage.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!