2605.04808v1 May 06, 2026 cs.AI

DecodingTrust-Agent 플랫폼 (DTap): AI 에이전트를 위한 제어 가능하고 상호 작용적인 레드 팀 플랫폼

DecodingTrust-Agent Platform (DTap): A Controllable and Interactive Red-Teaming Platform for AI Agents

Yuzhou Nie

Citations: 679

h-index: 10

Sanmi Koyejo

Citations: 4,394

h-index: 25

Tianneng Shi

Citations: 506

h-index: 11

Xiaogeng Liu

Citations: 2,503

h-index: 19

Chaowei Xiao

Citations: 1,201

h-index: 13

Zhaorun Chen

Citations: 179

h-index: 5

Mintong Kang

Citations: 1,339

h-index: 13

Xun Liu

Citations: 86

h-index: 2

Haibo Tong

Citations: 47

h-index: 4

Chengquan Guo

Citations: 132

h-index: 5

Jiawei Zhang

Citations: 622

h-index: 11

Chejian Xu

Citations: 2,094

h-index: 16

Qichang Liu

Citations: 5

h-index: 2

Percy Liang

Citations: 95

h-index: 4

Wenbo Guo

Citations: 288

h-index: 10

Dawn Song

Citations: 70

h-index: 3

Bo Li

Citations: 15

h-index: 2

AI 에이전트는 다양한 분야에서 복잡한 워크플로우를 자동화하기 위해 장기적인 관점과 높은 위험성을 가진 작업을 수행하는 데 점점 더 많이 활용되고 있습니다. 이러한 에이전트는 높은 기능과 유연성을 제공하지만, 동시에 심각한 보안 및 안전 문제를 야기합니다. 실제 사례에서 적대자는 API 키 유출, 사용자 데이터 삭제, 무단 거래 시작 등 유해한 작업을 수행하도록 에이전트를 쉽게 조작할 수 있다는 것이 밝혀졌습니다. 에이전트 보안 평가는 본질적으로 어려운 과제이며, 이는 에이전트가 외부 도구, 이기종 데이터 소스, 빈번한 사용자 상호 작용을 포함하는 동적이고 신뢰할 수 없는 환경에서 운영되기 때문입니다. 그러나 대규모 위험 평가를 위한 현실적이고 제어 가능하며 재현 가능한 환경은 아직 충분히 연구되지 않았습니다. 이러한 격차를 해소하기 위해, 우리는 14개의 실제 분야와 Google Workspace, Paypal, Slack과 같은 널리 사용되는 시스템을 모방하는 50개 이상의 시뮬레이션 환경을 포괄하는, AI 에이전트를 위한 최초의 제어 가능하고 상호 작용적인 레드 팀 플랫폼인 DecodingTrust-Agent 플랫폼 (DTap)을 소개합니다. DTap에서 에이전트의 위험 평가를 확장하기 위해, 우리는 DTap-Red를 제안합니다. DTap-Red는 최초의 자율적인 레드 팀 에이전트로, 다양한 공격 벡터(예: 프롬프트, 도구, 기술, 환경, 조합)를 체계적으로 탐색하고, 다양한 악의적인 목표에 맞게 효과적인 공격 전략을 자율적으로 발견합니다. DTap-Red를 사용하여, 우리는 다양한 분야에 걸쳐 고품질의 인스턴스를 포함하는 대규모 레드 팀 데이터셋인 DTap-Bench를 구축했습니다. 각 인스턴스는 공격 결과를 자동으로 검증하는 검증 기능을 갖춘 것입니다. DTap을 통해, 우리는 다양한 기반 모델을 기반으로 구축된 인기 있는 AI 에이전트의 대규모 평가를 수행하여 보안 정책, 위험 범주 및 공격 전략을 분석하고, 체계적인 취약점 패턴을 밝히고 차세대 보안 에이전트 개발을 위한 귀중한 통찰력을 제공합니다.

Original Abstract

AI agents are increasingly deployed across diverse domains to automate complex workflows through long-horizon and high-stakes action executions. Due to their high capability and flexibility, such agents raise significant security and safety concerns. A growing number of real-world incidents have shown that adversaries can easily manipulate agents into performing harmful actions, such as leaking API keys, deleting user data, or initiating unauthorized transactions. Evaluating agent security is inherently challenging, as agents operate in dynamic, untrusted environments involving external tools, heterogeneous data sources, and frequent user interactions. However, realistic, controllable, and reproducible environments for large-scale risk assessment remain largely underexplored. To address this gap, we introduce the DecodingTrust-Agent Platform (DTap), the first controllable and interactive red-teaming platform for AI agents, spanning 14 real-world domains and over 50 simulation environments that replicate widely used systems such as Google Workspace, Paypal, and Slack. To scale the risk assessment of agents in DTap, we further propose DTap-Red, the first autonomous red-teaming agent that systematically explores diverse injection vectors (e.g., prompt, tool, skill, environment, combinations) and autonomously discovers effective attack strategies tailored to varying malicious goals. Using DTap-Red, we curate DTap-Bench, a large-scale red-teaming dataset comprising high-quality instances across domains, each paired with a verifiable judge to automatically validate attack outcomes. Through DTap, we conduct large-scale evaluations of popular AI agents built on various backbone models, spanning security policies, risk categories, and attack strategies, revealing systematic vulnerability patterns and providing valuable insights for developing secure next-generation agents.

0 Citations

0 Influential

12.5 Altmetric

62.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!