2605.01970v2 May 03, 2026 cs.CR

트로이 목걸이 하마: 데이터 유출을 위한 에이전트 메모리 악용

Trojan Hippo: Weaponizing Agent Memory for Data Exfiltration

Florian Tramèr

ETH Zürich

Citations: 37,825

h-index: 54

Debeshee Das

Citations: 131

h-index: 3

Julien Piet

Citations: 677

h-index: 10

D. Kaviani

Citations: 10

h-index: 2

Luca Beurer-Kellner

Citations: 640

h-index: 9

David Wagner

Citations: 64

h-index: 3

메모리 시스템은 상태가 없는 LLM 에이전트가 사용자 정보를 세션 간에 유지하도록 하지만, 동시에 새로운 공격 표면을 제공합니다. 본 논문에서는 '트로이 목걸이 하마(Trojan Hippo)' 공격을 소개하며, 이는 기존의 메모리 오염 연구보다 더 현실적인 위협 모델에서 작동하는 지속적인 메모리 공격의 한 유형입니다. 공격자는 신뢰할 수 없는 도구 호출(예: 조작된 이메일)을 통해 에이전트의 장기 메모리에 잠재적인 악성 코드를 심어 넣고, 사용자가 금융, 건강 또는 개인 정보와 같은 민감한 주제에 대해 논의할 때만 해당 악성 코드가 활성화되어, 고가치의 개인 데이터를 공격자에게 유출합니다. 이러한 공격에 대한 일화적인 사례는 실제 시스템에서 보고되었지만, 기존 연구에서는 다양한 메모리 아키텍처 및 방어 시스템에 대한 체계적인 평가가 이루어지지 않았습니다. 본 논문에서는 두 가지 구성 요소로 이루어진 동적 평가 프레임워크를 제안합니다. (1) OpenEvolve 기반의 적응형 레드 팀 벤치마크는 방어 및 메모리 백엔드를 지속적으로 개선된 공격에 대해 스트레스 테스트하고, (2) 지속적인 메모리 시스템에 대한 최초의 기능 기반 보안/효용 분석을 제공하여, 다양한 사용 프로필에 따른 방어 전략 배치를 위한 체계적인 분석을 가능하게 합니다. 본 연구에서는 이메일 어시스턴트와 네 가지 메모리 백엔드(명시적 도구 메모리, 에이전트 메모리, RAG, 슬라이딩 윈도우 컨텍스트)를 사용하여 '트로이 목걸이 하마' 공격을 구현하고, OpenAI 및 Google의 최신 모델에 대해 85~100%의 높은 공격 성공률(ASR)을 달성했습니다. 심지어 100번의 정상적인 세션 후에도 심어진 메모리가 성공적으로 활성화되었습니다. 또한, 기본적인 보안 원칙에 기반한 네 가지 메모리 시스템 방어를 평가한 결과, 공격 성공률이 크게 감소(0~5%까지)하는 것을 확인했지만, 작업 요구 사항에 따라 다양한 수준의 효용성 손실이 발생했습니다. 이러한 보안과 효용성 간의 상당한 절충 관계 때문에, 방어 시스템의 실질적인 배치는 여전히 해결해야 할 과제이며, 본 연구에서 제안하는 평가 프레임워크는 이러한 과제를 해결하기 위해 특별히 설계되었습니다.

Original Abstract

Memory systems enable otherwise-stateless LLM agents to persist user information across sessions, but also introduce a new attack surface. We characterize the Trojan Hippo attack, a class of persistent memory attacks that operates in a more realistic threat model than prior memory poisoning work: the attacker plants a dormant payload into an agent's long-term memory via a single untrusted tool call (e.g., a crafted email), which activates only when the user later discusses sensitive topics such as finance, health, or identity, and exfiltrates high-value personal data to the attacker. While anecdotal demonstrations of such attacks have appeared against deployed systems, no prior work systematically evaluates them across heterogeneous memory architectures and defenses. We introduce a dynamic evaluation framework comprising two components: (1) an OpenEvolve-based adaptive red-teaming benchmark that stress-tests defenses and memory backends against continuously refined attacks, and (2) the first capability-aware security/utility analysis for persistent memory systems, enabling principled reasoning about defense deployment across different usage profiles. Instantiated on an email assistant across four memory backends (explicit tool memory, agentic memory, RAG, and sliding-window context), Trojan Hippo achieves up to 85-100% ASR against current frontier models from OpenAI and Google, with planted memories successfully activating even after 100 benign sessions. We evaluate four memory-system defenses inspired by basic security principles, finding they substantially reduce attack success rates (to as low as 0-5%), though at utility costs that vary widely with task requirements. Because of this substantial security-utility tradeoff, the effective real-world deployment of defenses remains an open challenge, which our evaluation framework is specifically designed to address.

1 Citations

0 Influential

27 Altmetric

136.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!