2604.18874v1 Apr 20, 2026 cs.AI

적대적 환경이 에이전트 AI를 어떻게 오도하는가?

How Adversarial Environments Mislead Agentic AI?

Peiyuan Jing

Citations: 15

h-index: 1

Huichi Zhou

Citations: 3

h-index: 1

Zhonghao Zhan

Citations: 51

h-index: 2

Zhenhao Li

Citations: 19

h-index: 3

Krinos Li

Citations: 2

h-index: 1

Hamed Haddadi

Citations: 51

h-index: 2

도구를 통합한 에이전트는 외부 도구가 에이전트의 결과를 현실과 연결한다는 전제하에 사용됩니다. 그러나 이러한 의존성은 심각한 공격 경로를 생성합니다. 현재의 평가는 일반적으로 안전한 환경에서 에이전트의 기능만을 평가하며, "에이전트가 도구를 올바르게 사용할 수 있는가?"라는 질문에 집중하지만, "만약 도구가 거짓 정보를 제공한다면?"이라는 질문은 고려하지 않습니다. 우리는 이러한 '신뢰 격차'를 확인했습니다. 즉, 에이전트는 성능만을 평가받으며 회의적인 시각을 평가받지는 못합니다. 우리는 이러한 취약점을 '적대적 환경 주입(Adversarial Environmental Injection, AEI)'이라는 위협 모델로 공식화했습니다. AEI는 적대자가 도구의 출력을 조작하여 에이전트를 속이는 행위를 의미합니다. AEI는 환경적 기만으로 이어지며, 여기에는 의심하지 않는 에이전트 주변에 가짜 검색 결과와 조작된 참조 네트워크를 구축하여 '가짜 세상'을 만드는 것이 포함됩니다. 우리는 이 개념을 POTEMKIN이라는, Model Context Protocol (MCP)과 호환되는 테스트 프레임워크를 통해 구현하여, 간편하게 적용 가능한 견고성 테스트를 수행할 수 있도록 했습니다. 우리는 두 가지 상호 독립적인 공격 경로를 확인했습니다. '환상(The Illusion)'은 폭넓은 공격으로, 검색 결과를 조작하여 에이전트가 잘못된 믿음을 갖도록 유도하는 공격입니다. 반면, '미로(The Maze)'는 심층적인 공격으로, 구조적인 함정을 이용하여 에이전트의 정책을 무너뜨리고 무한 루프에 빠뜨리는 공격입니다. 5개의 최첨단 에이전트에 대한 11,000건 이상의 테스트 결과, 에이전트의 견고성에는 뚜렷한 격차가 존재하며, 한 공격에 대한 저항력이 높을수록 다른 공격에 취약해지는 경향을 보였습니다. 이는 인식적 견고성과 탐색적 견고성이 서로 다른 능력임을 보여줍니다.

Original Abstract

Tool-integrated agents are deployed on the premise that external tools ground their outputs in reality. Yet this very reliance creates a critical attack surface. Current evaluations benchmark capability in benign settings, asking "can the agent use tools correctly" but never "what if the tools lie". We identify this Trust Gap: agents are evaluated for performance, not for skepticism. We formalize this vulnerability as Adversarial Environmental Injection (AEI), a threat model where adversaries compromise tool outputs to deceive agents. AEI constitutes environmental deception: constructing a "fake world" of poisoned search results and fabricated reference networks around unsuspecting agents. We operationalize this via POTEMKIN, a Model Context Protocol (MCP)-compatible harness for plug-and-play robustness testing. We identify two orthogonal attack surfaces: The Illusion (breadth attacks) poison retrieval to induce epistemic drift toward false beliefs, while The Maze (depth attacks) exploit structural traps to cause policy collapse into infinite loops. Across 11,000+ runs on five frontier agents, we find a stark robustness gap: resistance to one attack often increases vulnerability to the other, demonstrating that epistemic and navigational robustness are distinct capabilities.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!