2602.13562v1 Feb 14, 2026 cs.CR

LLM 정렬에서 적응적 안전 컨텍스트 학습을 통한 안전-유용성 균형 문제 완화

Mitigating the Safety-utility Trade-off in LLM Alignment via Adaptive Safe Context Learning

Yongcan Yu

Citations: 40

h-index: 3

Yanbo Wang

Citations: 41

h-index: 4

Minzheng Wang

Institute of Automation, Chinese Academy of Sciences

Citations: 304

h-index: 8

Jian Liang

Citations: 212

h-index: 7

Lu Wang

Citations: 312

h-index: 3

R. He

Citations: 1,317

h-index: 15

추론 모델은 복잡한 추론 작업에서 놀라운 성공을 거두었지만, 그 성능이 향상될수록 엄격한 안전 조치가 필요합니다. 안전 정렬의 핵심 과제는 안전과 유용성 간의 근본적인 상충 관계에 있습니다. 그러나 기존의 정렬 전략은 일반적으로 컨텍스트 증류를 통해 명시적인 안전 규칙을 포함하는 CoT(Chain-of-Thought) 학습 데이터를 구성합니다. 이러한 접근 방식은 규칙 암기와 거부 간의 경직된 연관성을 만들어냄으로써 의사 결정 능력을 의도치 않게 제한합니다. 안전-유용성 균형 문제를 완화하기 위해, 우리는 적절한 컨텍스트를 기반으로 추론 능력을 향상시키는 Adaptive Safe Context Learning (ASCL) 프레임워크를 제안합니다. ASCL은 안전 정렬을 다단계 도구 사용 프로세스로 정의하며, 모델이 안전 규칙을 언제 참조해야 하는지, 그리고 어떻게 추론을 진행해야 하는지를 자율적으로 결정할 수 있도록 합니다. 또한, 강화 학습 과정에서 규칙 참조에 대한 선호도를 완화하기 위해 Inverse Frequency Policy Optimization (IFPO)를 도입하여 가치 추정치를 재조정합니다. 규칙 검색과 후속 추론을 분리함으로써, 우리의 방법은 기존 방법보다 더 높은 전반적인 성능을 달성합니다.

Original Abstract

While reasoning models have achieved remarkable success in complex reasoning tasks, their increasing power necessitates stringent safety measures. For safety alignment, the core challenge lies in the inherent trade-off between safety and utility. However, prevailing alignment strategies typically construct CoT training data with explicit safety rules via context distillation. This approach inadvertently limits reasoning capabilities by creating a rigid association between rule memorization and refusal. To mitigate the safety-utility trade-off, we propose the Adaptive Safe Context Learning (ASCL) framework to improve the reasoning given proper context. ASCL formulates safety alignment as a multi-turn tool-use process, empowering the model to autonomously decide when to consult safety rules and how to generate the ongoing reasoning. Furthermore, to counteract the preference for rule consultation during RL, we introduce Inverse Frequency Policy Optimization (IFPO) to rebalance advantage estimates. By decoupling rule retrieval and subsequent reasoning, our method achieves higher overall performance compared to baselines.

3 Citations

0 Influential

7.5 Altmetric

40.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!