2604.28129v1 Apr 30, 2026 cs.CR

잠재적 적대적 탐지: 다중 턴 공격 탐지를 위한 LLM 활성화의 적응적 분석

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Citations: 58

h-index: 3

다중 턴 프롬프트 주입 공격은 일반적으로 신뢰 구축, 전환, 에스컬레이션이라는 공격 경로를 따르지만, 텍스트 수준의 방어는 개별 턴이 무해하게 보이는 은밀한 공격을 놓칠 수 있습니다. 본 연구에서는 이러한 공격 경로가 모델의 잔류 스트림에서 활성화 수준의 특징을 남긴다는 것을 보여줍니다. 각 단계 변화는 활성화를 이동시키며, 이는 정상적인 대화보다 훨씬 긴 전체 경로를 생성합니다. 우리는 이를 '적대적 불안정성'이라고 부릅니다. 이 신호를 포착하는 다섯 가지 스칼라 경로 특징을 사용하여 합성 데이터에서 대화 수준의 탐지 정확도를 76.2%에서 93.8%로 향상시켰습니다. 이 신호는 24B-70B 파라미터의 네 가지 모델 계열에서 모두 나타나지만, 프로브는 모델별이며 아키텍처 간에 전송되지 않습니다. 일반화 성능은 데이터 소스에 따라 달라집니다. 데이터 소스를 하나씩 제외한 평가 결과, 합성 데이터, LMSYS-Chat-1M, SafeDialBench는 각각 고유한 공격 분포를 나타냅니다. 실제 환경인 LMSYS 데이터에 대한 탐지율은 47%에서 71%에 이르렀으며, 이는 학습 데이터에 LMSYS의 분포가 반영되었을 때입니다. 세 가지 데이터 소스를 결합하여 학습했을 때, 보류 데이터 세트에 대한 탐지율은 89.4%였으며, 오탐율은 2.4%였습니다. 또한, 본 연구에서 개발한 합성 데이터 세트에만 존재하는 세 단계(정상/전환/적대적)로 분류된 턴 수준 레이블이 필수적이라는 것을 확인했습니다. 이진 분류 방식의 대화 수준 레이블을 사용할 경우 오탐율이 50%에서 59%에 달했습니다. 이러한 결과는 '적대적 불안정성'이 신뢰할 수 있는 활성화 수준의 신호임을 입증하며, 실제 배포를 위한 데이터 요구 사항을 규정합니다.

Original Abstract

Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71% when its distribution is represented in training. Combined three-source training achieves 89.4% detection at 2.4% false positive rate on a held-out mixed set. We further show that three-phase turn-level labels(benign/pivoting/adversarial) unique to our synthetic dataset are essential: binary conversation-level labels produce 50-59% false positives. These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment.

1 Citations

0 Influential

1.5 Altmetric

8.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!