2603.17305v1 Mar 18, 2026 cs.AI

대조적 추론 정렬: 숨겨진 표현을 활용한 강화 학습

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

Yan Chen

Citations: 10

h-index: 2

Haozheng Luo

Northwestern University

Citations: 257

h-index: 9

Yiming Wang

Citations: 1

h-index: 1

Jiahao Yu

Citations: 23

h-index: 2

Binghui Wang

Citations: 13

h-index: 1

본 논문에서는 CRAFT라는 적대적 테스트 기반의 정렬 프레임워크를 제안합니다. CRAFT는 모델의 추론 능력과 숨겨진 표현을 활용하여, 탈옥 공격에 대한 견고성을 향상시킵니다. 기존의 방어 기법들이 주로 출력 수준에서 작동하는 것과 달리, CRAFT는 대규모 추론 모델을 숨겨진 상태 공간에 정의된 목표를 명시적으로 최적화하여 안전 의식을 가진 추론 과정을 생성하도록 정렬합니다. 방법론적으로, CRAFT는 대조적 표현 학습과 강화 학습을 통합하여 안전하고 위험한 추론 경로를 분리하고, 견고한 추론 수준의 안전 정렬을 지원하는 잠재 공간 기하학을 구축합니다. 이론적으로, GRPO에 잠재적 텍스트 일관성을 통합하면 피상적으로 정렬된 정책을 제거하여, 이를 지역 최적해로 간주하지 않음을 보여줍니다. 실험적으로, Qwen3-4B-Thinking 및 R1-Distill-Llama-8B라는 강력한 추론 모델을 사용하여 여러 안전성 벤치마크에서 CRAFT를 평가한 결과, IPO 및 SafeKey와 같은 최첨단 방어 기법보다 일관되게 뛰어난 성능을 보였습니다. 특히, CRAFT는 기준 모델에 비해 평균 79.0%의 추론 안전성 향상과 87.7%의 최종 응답 안전성 향상을 제공하여, 숨겨진 공간 추론 정렬의 효과를 입증합니다.

Original Abstract

We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!