2603.25412v1 Mar 26, 2026 cs.AI

콘텐츠 안전을 넘어: 대규모 언어 모델의 추론 취약점에 대한 실시간 모니터링

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

Zongjie Li

Citations: 1,026

h-index: 20

Zhenlan Ji

HKUST

Citations: 244

h-index: 10

Pingchuan Ma

Hong Kong University of Science and Technology

Citations: 1,103

h-index: 21

Yuguang Zhou

Citations: 12

h-index: 1

Qingyue Wang

Citations: 10

h-index: 1

Ruixuan Huang

Citations: 5

h-index: 1

Xunguang Wang

Citations: 93

h-index: 4

Shuai Wang

Citations: 156

h-index: 8

대규모 언어 모델(LLM)은 복잡한 작업을 해결하기 위해 점차 명시적인 연쇄적 사고(Chain-of-Thought, CoT) 추론에 의존하고 있지만, 추론 과정 자체의 안전성은 여전히 충분히 다루어지지 않고 있습니다. 기존의 LLM 안전 연구는 주로 콘텐츠 안전에 초점을 맞추는데, 이는 유해하거나 편향된, 또는 사실과 다른 출력을 탐지하는 것을 의미하며, 추론 과정을 불투명한 중간 결과물로 취급합니다. 본 연구에서는 추론 안전을 독립적이고, 동등하게 중요한 보안 차원으로 정의합니다. 이는 모델의 추론 경로가 논리적으로 일관되고, 계산적으로 효율적이며, 적대적 조작에 강해야 한다는 요구사항을 의미합니다. 본 연구는 세 가지 주요 기여를 합니다. 첫째, 추론 안전을 공식적으로 정의하고, 입력 파싱 오류, 추론 실행 오류, 프로세스 관리 오류를 포함하는 9가지 유형의 안전하지 않은 추론 행동에 대한 분류 체계를 제시합니다. 둘째, 자연적인 추론 벤치마크와 네 가지 적대적 공격 방법(추론 탈취 및 서비스 거부 공격)에서 추출한 4111개의 추론 과정을 분석하여, 9가지 오류 유형이 실제로 발생하며, 각 공격이 특정한 패턴을 유발한다는 것을 확인합니다. 셋째, 대상 모델과 병렬로 실행되어, 분류 체계에 기반한 프롬프트를 통해 각 추론 단계를 실시간으로 검사하고, 안전하지 않은 행동을 감지하면 인터럽트 신호를 보내는 외부 LLM 기반 모듈인 "추론 안전 모니터(Reasoning Safety Monitor)"를 제안합니다. 450개의 추론 과정을 대상으로 한 정적 벤치마크 평가 결과, 제안하는 모니터는 84.88%의 단계별 정확도와 85.37%의 오류 유형 분류 정확도를 달성하여, 환각 탐지기 및 프로세스 보상 모델 기반의 기존 방법보다 훨씬 우수한 성능을 보였습니다. 이러한 결과는 추론 수준의 모니터링이 필요하며 실용적으로 구현 가능하다는 것을 보여주며, 추론 안전을 대규모 추론 모델의 안전한 배포를 위한 기본적인 요소로 확립합니다.

Original Abstract

Large language models (LLMs) increasingly rely on explicit chain-of-thought (CoT) reasoning to solve complex tasks, yet the safety of the reasoning process itself remains largely unaddressed. Existing work on LLM safety focuses on content safety--detecting harmful, biased, or factually incorrect outputs -- and treats the reasoning chain as an opaque intermediate artifact. We identify reasoning safety as an orthogonal and equally critical security dimension: the requirement that a model's reasoning trajectory be logically consistent, computationally efficient, and resistant to adversarial manipulation. We make three contributions. First, we formally define reasoning safety and introduce a nine-category taxonomy of unsafe reasoning behaviors, covering input parsing errors, reasoning execution errors, and process management errors. Second, we conduct a large-scale prevalence study annotating 4111 reasoning chains from both natural reasoning benchmarks and four adversarial attack methods (reasoning hijacking and denial-of-service), confirming that all nine error types occur in practice and that each attack induces a mechanistically interpretable signature. Third, we propose a Reasoning Safety Monitor: an external LLM-based component that runs in parallel with the target model, inspects each reasoning step in real time via a taxonomy-embedded prompt, and dispatches an interrupt signal upon detecting unsafe behavior. Evaluation on a 450-chain static benchmark shows that our monitor achieves up to 84.88\% step-level localization accuracy and 85.37\% error-type classification accuracy, outperforming hallucination detectors and process reward model baselines by substantial margins. These results demonstrate that reasoning-level monitoring is both necessary and practically achievable, and establish reasoning safety as a foundational concern for the secure deployment of large reasoning models.

1 Citations

1 Influential

10.5 Altmetric

55.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!