2602.13093v2 Feb 13, 2026 cs.AI

다중 턴 공격 환경에서의 대규모 추론 모델의 일관성

Consistency of Large Reasoning Models Under Multi-Turn Attacks

R. Krishnan

Citations: 7,304

h-index: 47

R. Padman

Citations: 156

h-index: 3

Yubo Li

Citations: 119

h-index: 4

추론 능력을 갖춘 대규모 추론 모델은 복잡한 작업에서 최첨단 성능을 달성하지만, 다중 턴 공격 상황에서의 안정성은 아직 충분히 연구되지 않았습니다. 본 연구에서는 9개의 최신 추론 모델을 적대적 공격에 노출시켜 평가했습니다. 연구 결과, 추론 능력이 의미 있는 수준의 안정성을 제공하지만, 완벽하지 않다는 것을 보여줍니다. 대부분의 추론 모델이 instruction-tuning된 기본 모델보다 훨씬 뛰어난 성능을 보이지만, 모든 모델이 고유한 취약점 프로필을 가지고 있습니다. 오해를 유발하는 제안은 보편적으로 효과적이며, 사회적 압력은 모델에 따라 다른 효과를 보였습니다. 추론 경로 분석을 통해 5가지 실패 모드(자기 의심, 사회적 순응, 제안 조작, 감정적 취약성, 추론 피로)를 식별했으며, 이 중 처음 두 가지가 전체 실패의 50%를 차지했습니다. 또한, 일반적인 LLM에 효과적인 Confidence-Aware Response Generation (CARG) 방법이, 확장된 추론 과정으로 인해 발생하는 과신 때문에 추론 모델에서는 효과가 없음을 확인했습니다. 놀랍게도, 목표 지향적인 추출 방법보다 임의의 신뢰도 임베딩이 더 나은 성능을 보였습니다. 본 연구 결과는 추론 능력이 자동으로 적대적 안정성을 제공하지 않으며, 추론 모델에 대한 신뢰도 기반 방어 시스템은 근본적으로 재설계되어야 함을 시사합니다.

Original Abstract

Large reasoning models with reasoning capabilities achieve state-of-the-art performance on complex tasks, but their robustness under multi-turn adversarial pressure remains underexplored. We evaluate nine frontier reasoning models under adversarial attacks. Our findings reveal that reasoning confers meaningful but incomplete robustness: most reasoning models studied significantly outperform instruction-tuned baselines, yet all exhibit distinct vulnerability profiles, with misleading suggestions universally effective and social pressure showing model-specific efficacy. Through trajectory analysis, we identify five failure modes (Self-Doubt, Social Conformity, Suggestion Hijacking, Emotional Susceptibility, and Reasoning Fatigue) with the first two accounting for 50% of failures. We further demonstrate that Confidence-Aware Response Generation (CARG), effective for standard LLMs, fails for reasoning models due to overconfidence induced by extended reasoning traces; counterintuitively, random confidence embedding outperforms targeted extraction. Our results highlight that reasoning capabilities do not automatically confer adversarial robustness and that confidence-based defenses require fundamental redesign for reasoning models.

0 Citations

0 Influential

23.5 Altmetric

117.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!