2604.04060v1 Apr 05, 2026 cs.CR

CoopGuard: 상태 기반 협력 에이전트를 활용하여 LLM을 진화하는 다중 라운드 공격으로부터 보호하는 방법

CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks

Siyuan Li

Citations: 78

h-index: 5

Xi Lin

Citations: 13

h-index: 2

Zehao Liu

Citations: 9

h-index: 2

Jianhua Li

Citations: 24

h-index: 2

Jun Wu

Citations: 34

h-index: 3

Qinghua Mao

Citations: 46

h-index: 3

Yuliang Chen

Citations: 3

h-index: 1

Haoyu Li

Citations: 30

h-index: 2

Xiu Su

Citations: 17

h-index: 2

대규모 언어 모델(LLM)이 복잡한 애플리케이션에 점점 더 많이 사용됨에 따라, 적대적 공격에 대한 취약성이 심각한 안전 문제를 야기하며, 특히 다중 라운드 상호 작용을 통해 진화하는 공격은 더욱 심각합니다. 기존의 방어 기술은 주로 반응적인 방식으로 작동하며, 공격자들이 라운드별로 전략을 개선함에 따라 적응하는 데 어려움을 겪습니다. 본 연구에서는 CoopGuard라는 상태 기반 다중 라운드 LLM 방어 프레임워크를 제안합니다. CoopGuard는 협력 에이전트를 기반으로 하며, 내부 방어 상태를 유지하고 업데이트하여 진화하는 공격에 대응합니다. 본 시스템은 세 가지 특수 에이전트(지연 에이전트, 유인 에이전트, 포렌식 에이전트)를 사용하며, 이들은 상호 보완적인 라운드별 전략을 수행합니다. 시스템 에이전트는 이러한 에이전트들의 활동을 조정하며, 변화하는 방어 상태(상호 작용 기록)에 기반하여 의사 결정을 내리고 에이전트들의 활동을 시간 경과에 따라 조율합니다. 진화하는 위협을 평가하기 위해, 8가지 공격 유형에 걸쳐 5,200개의 적대적 샘플을 포함하는 EMRA 벤치마크를 도입하여 LLM의 다중 라운드 공격을 시뮬레이션합니다. 실험 결과, CoopGuard는 최첨단 방어 기술에 비해 공격 성공률을 78.9% 감소시키고, 오탐율을 186% 향상시키며, 공격 효율성을 167.9% 감소시켜 다중 라운드 방어에 대한 보다 포괄적인 평가를 제공합니다. 이러한 결과는 CoopGuard가 다중 라운드 적대적 시나리오에서 LLM에 대한 강력한 보호 기능을 제공한다는 것을 보여줍니다.

Original Abstract

As Large Language Models (LLMs) are increasingly deployed in complex applications, their vulnerability to adversarial attacks raises urgent safety concerns, especially those evolving over multi-round interactions. Existing defenses are largely reactive and struggle to adapt as adversaries refine strategies across rounds. In this work, we propose CoopGuard , a stateful multi-round LLM defense framework based on cooperative agents that maintains and updates an internal defense state to counter evolving attacks. It employs three specialized agents (Deferring Agent, Tempting Agent, and Forensic Agent) for complementary round-level strategies, coordinated by System Agent, which conditions decisions on the evolving defense state (interaction history) and orchestrates agents over time. To evaluate evolving threats, we introduce the EMRA benchmark with 5,200 adversarial samples across 8 attack types, simulating progressively LLM multi-round attacks. Experiments show that CoopGuard reduces attack success rate by 78.9% over state-of-the-art defenses, while improving deceptive rate by 186% and reducing attack efficiency by 167.9%, offering a more comprehensive assessment of multi-round defense. These results demonstrate that CoopGuard provides robust protection for LLMs in multi-round adversarial scenarios.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!