2601.04566v2 Jan 08, 2026 cs.AI

BackdoorAgent: LLM 기반 에이전트에 대한 백도어 공격의 통합 프레임워크

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents

Yunhao Feng

Citations: 43

h-index: 2

Yige Li

Citations: 459

h-index: 14

Yutao Wu

Citations: 40

h-index: 2

Yingshui Tan

Citations: 4

h-index: 2

Yifan Ding

Citations: 104

h-index: 4

Kun Zhai

Citations: 28

h-index: 2

Xingjun Ma

Citations: 9

h-index: 2

Yugang Jiang

Citations: 41

h-index: 2

Yanming Guo

Citations: 95

h-index: 4

거대 언어 모델(LLM) 에이전트는 계획, 기억, 도구 사용을 결합한 다단계 워크플로를 통해 작업을 수행합니다. 이러한 설계는 자율성을 가능하게 하지만, 동시에 백도어 위협에 대한 공격 표면을 확장시킵니다. 에이전트 워크플로의 특정 단계에 주입된 백도어 트리거는 여러 중간 상태를 거쳐 지속될 수 있으며 다운스트림 출력에 악영향을 미칠 수 있습니다. 그러나 기존 연구들은 파편화되어 있고 주로 개별 공격 벡터를 고립된 상태로 분석하므로, 에이전트 중심 관점에서 백도어 트리거의 단계 간 상호작용 및 전파에 대한 이해가 부족합니다. 이러한 공백을 메우기 위해, 우리는 LLM 에이전트의 백도어 위협에 대해 통일된 에이전트 중심의 관점을 제공하는 모듈식 단계 인식 프레임워크인 BackdoorAgent를 제안합니다. BackdoorAgent는 공격 표면을 에이전트 워크플로의 세 가지 기능적 단계인 계획 공격, 기억 공격, 도구 사용 공격으로 구조화하고, 다양한 단계에 걸친 트리거 활성화 및 전파를 체계적으로 분석할 수 있도록 에이전트 실행을 계측합니다. 이 프레임워크를 기반으로 우리는 언어 전용 및 멀티모달 설정을 모두 포괄하는 네 가지 대표적인 에이전트 애플리케이션인 Agent QA, Agent Code, Agent Web, Agent Drive에 걸친 표준화된 벤치마크를 구축합니다. 실증적 분석 결과, 단일 단계에 심어진 트리거가 여러 단계에 걸쳐 지속되고 중간 상태를 통해 전파될 수 있음을 보여줍니다. 예를 들어, GPT 기반 백본을 사용할 때 계획 공격의 43.58%, 기억 공격의 77.97%, 도구 단계 공격의 60.28%에서 트리거 지속성이 관찰되었으며, 이는 백도어 위협에 대한 에이전트 워크플로 자체의 취약성을 부각시킵니다. 재현성 및 향후 연구를 돕기 위해 코드와 벤치마크는 GitHub에 공개되어 있습니다.

Original Abstract

Large language model (LLM) agents execute tasks through multi-step workflows that combine planning, memory, and tool use. While this design enables autonomy, it also expands the attack surface for backdoor threats. Backdoor triggers injected into specific stages of an agent workflow can persist through multiple intermediate states and adversely influence downstream outputs. However, existing studies remain fragmented and typically analyze individual attack vectors in isolation, leaving the cross-stage interaction and propagation of backdoor triggers poorly understood from an agent-centric perspective. To fill this gap, we propose \textbf{BackdoorAgent}, a modular and stage-aware framework that provides a unified, agent-centric view of backdoor threats in LLM agents. BackdoorAgent structures the attack surface into three functional stages of agentic workflows, including \textbf{planning attacks}, \textbf{memory attacks}, and \textbf{tool-use attacks}, and instruments agent execution to enable systematic analysis of trigger activation and propagation across different stages. Building on this framework, we construct a standardized benchmark spanning four representative agent applications: \textbf{Agent QA}, \textbf{Agent Code}, \textbf{Agent Web}, and \textbf{Agent Drive}, covering both language-only and multimodal settings. Our empirical analysis shows that \textit{triggers implanted at a single stage can persist across multiple steps and propagate through intermediate states.} For instance, when using a GPT-based backbone, we observe trigger persistence in 43.58\% of planning attacks, 77.97\% of memory attacks, and 60.28\% of tool-stage attacks, highlighting the vulnerabilities of the agentic workflow itself to backdoor threats. To facilitate reproducibility and future research, our code and benchmark are publicly available at GitHub.

2 Citations

0 Influential

7 Altmetric

37.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!