2603.01574v1 Mar 02, 2026 cs.CR

DualSentinel: 이중 엔트로피 럼(Lull) 패턴을 이용한 블랙박스 LLM의 표적 공격 탐지를 위한 경량화된 프레임워크

DualSentinel: A Lightweight Framework for Detecting Targeted Attacks in Black-box LLM via Dual Entropy Lull Pattern

Xiaoyi Pang

Citations: 868

h-index: 14

Xuanyi Hao

Citations: 2

h-index: 1

Peng Liu

Citations: 220

h-index: 2

Qingze Luo

Citations: 1

h-index: 1

Song Guo

Citations: 46

h-index: 4

Zhibo Wang

Citations: 340

h-index: 11

최근의 지능형 시스템은 강력한 대규모 언어 모델(LLM)을 API를 통해 통합하고 있지만, 백도어 공격 및 프롬프트 주입 공격과 같은 표적 공격으로 인해 신뢰성이 심각하게 훼손될 수 있습니다. 이러한 공격은 LLM이 특정 악성 시퀀스를 생성하도록 은밀하게 유도합니다. 기존의 이러한 위협에 대한 방어적 접근 방식은 일반적으로 높은 접근 권한을 필요로 하며, 상당한 비용을 발생시키고, 정상적인 추론을 방해하여 실제 시나리오에서 비실용적입니다. 이러한 한계를 해결하기 위해, 우리는 LLM 생성 과정과 함께 표적 공격의 활성화를 정확하고 신속하게 탐지할 수 있는 경량화되고 통합된 방어 프레임워크인 DualSentinel을 소개합니다. 우리는 먼저 손상된 LLM의 특징인 '엔트로피 럼(Entropy Lull)'을 식별합니다. 표적 공격이 생성 과정을 성공적으로 제어하면, LLM은 비정상적으로 낮고 안정적인 토큰 확률 엔트로피를 나타내는 특정 기간을 보이는데, 이는 창의적인 선택 대신 고정된 경로를 따르고 있음을 나타냅니다. DualSentinel은 이러한 패턴을 활용하여 혁신적인 이중 검증 방식을 개발합니다. 먼저, 크기 및 추세에 민감한 모니터링 방법을 사용하여 런타임 시 엔트로피 럼 패턴을 사전에 감지하고 신속하게 플래그를 지정합니다. 이러한 플래그가 지정되면, 가벼우면서도 강력한 2차 검증 프로세스를 실행하며, 이는 '태스크 플리핑(task-flipping)'을 기반으로 합니다. 엔트로피 럼 패턴이 원래 태스크와 플립된 태스크 모두에서 지속될 경우에만 공격이 확인되는데, 이는 LLM의 출력이 강제적으로 제어되고 있음을 증명합니다. 광범위한 실험 결과, DualSentinel은 매우 높은 효율성(거의 0%의 오탐)과 뛰어난 정확도를 제공하며, 배포된 LLM을 보호하기 위한 실용적인 솔루션을 제시합니다. 소스 코드는 https://doi.org/10.5281/zenodo.18479273 에서 확인할 수 있습니다.

Original Abstract

Recent intelligent systems integrate powerful Large Language Models (LLMs) through APIs, but their trustworthiness may be critically undermined by targeted attacks like backdoor and prompt injection attacks, which secretly force LLMs to generate specific malicious sequences. Existing defensive approaches for such threats typically rely on high access rights, impose prohibitive costs, and hinder normal inference, rendering them impractical for real-world scenarios. To solve these limitations, we introduce DualSentinel, a lightweight and unified defense framework that can accurately and promptly detect the activation of targeted attacks alongside the LLM generation process. We first identify a characteristic of compromised LLMs, termed Entropy Lull: when a targeted attack successfully hijacks the generation process, the LLM exhibits a distinct period of abnormally low and stable token probability entropy, indicating it is following a fixed path rather than making creative choices. DualSentinel leverages this pattern by developing an innovative dual-check approach. It first employs a magnitude and trend-aware monitoring method to proactively and sensitively flag an entropy lull pattern at runtime. Upon such flagging, it triggers a lightweight yet powerful secondary verification based on task-flipping. An attack is confirmed only if the entropy lull pattern persists across both the original and the flipped task, proving that the LLM's output is coercively controlled. Extensive evaluations show that DualSentinel is both highly effective (superior detection accuracy with near-zero false positives) and remarkably efficient (negligible additional cost), offering a truly practical path toward securing deployed LLMs. The source code can be accessed at https://doi.org/10.5281/zenodo.18479273.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!