2604.09750v1 Apr 10, 2026 cs.CR

충돌은 대규모 추론 모델을 공격에 취약하게 만든다

Conflicts Make Large Reasoning Models Vulnerable to Attacks

Xuhui Jiang

Citations: 1,762

h-index: 12

Honghao Liu

Citations: 1,280

h-index: 3

Cehao Yang

Citations: 118

h-index: 5

Zhengwu Ma

Citations: 37

h-index: 4

Lionel M. Ni

Citations: 330

h-index: 4

Cheng Xu

Citations: 123

h-index: 3

Sheng Yin

Citations: 172

h-index: 5

Jian Guo

Citations: 1,793

h-index: 12

대규모 추론 모델(LRM)은 다양한 분야에서 뛰어난 성능을 보여주었지만, 상충되는 목표 하에서의 의사 결정 방식은 아직 충분히 이해되지 못하고 있습니다. 본 연구에서는 LRM이 두 가지 유형의 충돌 상황, 즉 서로 충돌하는 정렬 가치를 야기하는 내부 충돌과, 희생, 강압, 주체 중심, 사회적 형태를 포함하는 상호 모순적인 선택을 강요하는 딜레마에 직면했을 때 어떻게 반응하는지 조사합니다. 5개의 벤치마크에서 1,300개 이상의 프롬프트를 사용하여 Llama-3.1-Nemotron-8B, QwQ-32B, DeepSeek R1의 세 가지 대표적인 LRM을 평가한 결과, 충돌은 정교한 자동 공격 기술 없이도 단일 라운드의 비 서사적 쿼리에서 공격 성공률을 크게 증가시키는 것으로 나타났습니다. 계층별 및 뉴런 수준 분석을 통해 충돌 상황에서 안전 관련 및 기능적 표현이 이동하고 겹쳐 안전에 부합하는 행동을 방해하는 것을 확인했습니다. 본 연구는 차세대 추론 모델의 견고성과 신뢰성을 확보하기 위한 더욱 심층적인 정렬 전략의 필요성을 강조합니다. 본 연구의 코드는 https://github.com/DataArcTech/ConflictHarm 에서 확인할 수 있습니다. 경고: 본 논문에는 부적절하고 공격적이며 유해한 내용이 포함되어 있습니다.

Original Abstract

Large Reasoning Models (LRMs) have achieved remarkable performance across diverse domains, yet their decision-making under conflicting objectives remains insufficiently understood. This work investigates how LRMs respond to harmful queries when confronted with two categories of conflicts: internal conflicts that pit alignment values against each other and dilemmas, which impose mutually contradictory choices, including sacrificial, duress, agent-centered, and social forms. Using over 1,300 prompts across five benchmarks, we evaluate three representative LRMs - Llama-3.1-Nemotron-8B, QwQ-32B, and DeepSeek R1 - and find that conflicts significantly increase attack success rates, even under single-round non-narrative queries without sophisticated auto-attack techniques. Our findings reveal through layerwise and neuron-level analyses that safety-related and functional representations shift and overlap under conflict, interfering with safety-aligned behavior. This study highlights the need for deeper alignment strategies to ensure the robustness and trustworthiness of next-generation reasoning models. Our code is available at https://github.com/DataArcTech/ConflictHarm. Warning: This paper contains inappropriate, offensive and harmful content.

0 Citations

0 Influential

26 Altmetric

130.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!