2601.04034v1 Jan 07, 2026 cs.CR

HoneyTrap: 탄력적인 다중 에이전트 방어를 통한 꿀통 함정을 활용하여 대규모 언어 모델 공격자를 속이는 방법

HoneyTrap: Deceiving Large Language Model Attackers to Honeypot Traps with Resilient Multi-Agent Defense

Siyuan Li

Citations: 78

h-index: 5

Xi Lin

Citations: 13

h-index: 2

Jun Wu

Citations: 59

h-index: 3

Zehao Liu

Citations: 9

h-index: 2

Haoyu Li

Citations: 1,254

h-index: 4

Tianjie Ju

Citations: 384

h-index: 10

Xiang Chen

Citations: 14

h-index: 2

Jianhua Li

Citations: 24

h-index: 2

대규모 언어 모델(LLM)에 대한 탈옥 공격은 심각한 위협이며, 공격자가 안전 장치를 우회할 수 있도록 합니다. 그러나 기존의 수동적인 방어 접근 방식은 공격자가 취약점을 악용하기 위해 지속적으로 공격을 심화시키는 빠르게 진화하는 다중 턴 탈옥 공격에 대응하기 어렵습니다. 이러한 중요한 문제에 대응하기 위해, 우리는 협력적인 방어자들을 활용하여 탈옥 공격을 방어하는 새로운 기만형 LLM 방어 프레임워크인 HoneyTrap을 제안합니다. HoneyTrap은 위협 차단기, 오도 제어기, 법의학 추적기, 시스템 조율기라는 네 가지 방어 에이전트를 통합하여 각 에이전트가 전문적인 보안 역할을 수행하고 협력하여 기만적인 방어를 완성합니다. 포괄적인 평가를 위해, 우리는 점진적으로 다중 턴 공격 전략을 심화시키는 일곱 가지 고급 탈옥 전략을 결합한 도전적인 다중 턴 점진적 탈옥 데이터 세트인 MTJ-Pro를 소개합니다. 또한, 우리는 기존의 측정 기준을 넘어 기만적인 방어에 대한 더욱 미묘한 평가를 제공하는 Mislead Success Rate (MSR) 및 Attack Resource Consumption (ARC)이라는 두 가지 새로운 지표를 제시합니다. GPT-4, GPT-3.5-turbo, Gemini-1.5-pro, 및 LLaMa-3.1에 대한 실험 결과는 HoneyTrap이 최첨단 기준과 비교하여 평균적으로 공격 성공률을 68.77% 감소시킨다는 것을 보여줍니다. 특히, 강화된 조건의 전용 적응 공격 환경에서도 HoneyTrap은 탄력성을 유지하며, 기만적인 상호 작용을 활용하여 상호 작용 시간을 연장하고 성공적인 악용에 필요한 시간과 계산 비용을 크게 증가시킵니다. HoneyTrap은 단순한 거부와 달리, 공격자 자원을 전략적으로 낭비하면서도 정상적인 쿼리에 영향을 미치지 않으며, MSR과 ARC를 각각 118.11% 및 149.16% 향상시킵니다.

Original Abstract

Jailbreak attacks pose significant threats to large language models (LLMs), enabling attackers to bypass safeguards. However, existing reactive defense approaches struggle to keep up with the rapidly evolving multi-turn jailbreaks, where attackers continuously deepen their attacks to exploit vulnerabilities. To address this critical challenge, we propose HoneyTrap, a novel deceptive LLM defense framework leveraging collaborative defenders to counter jailbreak attacks. It integrates four defensive agents, Threat Interceptor, Misdirection Controller, Forensic Tracker, and System Harmonizer, each performing a specialized security role and collaborating to complete a deceptive defense. To ensure a comprehensive evaluation, we introduce MTJ-Pro, a challenging multi-turn progressive jailbreak dataset that combines seven advanced jailbreak strategies designed to gradually deepen attack strategies across multi-turn attacks. Besides, we present two novel metrics: Mislead Success Rate (MSR) and Attack Resource Consumption (ARC), which provide more nuanced assessments of deceptive defense beyond conventional measures. Experimental results on GPT-4, GPT-3.5-turbo, Gemini-1.5-pro, and LLaMa-3.1 demonstrate that HoneyTrap achieves an average reduction of 68.77% in attack success rates compared to state-of-the-art baselines. Notably, even in a dedicated adaptive attacker setting with intensified conditions, HoneyTrap remains resilient, leveraging deceptive engagement to prolong interactions, significantly increasing the time and computational costs required for successful exploitation. Unlike simple rejection, HoneyTrap strategically wastes attacker resources without impacting benign queries, improving MSR and ARC by 118.11% and 149.16%, respectively.

2 Citations

0 Influential

5 Altmetric

27.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!