2605.04785v1 May 06, 2026 cs.AI

AgentTrust: AI 에이전트 도구 사용에 대한 런타임 안전성 평가 및 제어

AgentTrust: Runtime Safety Evaluation and Interception for AI Agent Tool Use

Citations: 4,628

h-index: 3

현대의 AI 에이전트는 파일 작업, 셸 명령어, HTTP 요청, 데이터베이스 쿼리와 같은 도구 호출을 통해 실제 세계에 영향을 미칩니다. 단 하나의 안전하지 못한 행동, 예를 들어 실수로 인한 삭제, 자격 증명 노출 또는 데이터 유출은 되돌릴 수 없는 피해를 초래할 수 있습니다. 기존의 방어 기법들은 불완전합니다. 사후 벤치마크는 실행 후의 행동을 측정하고, 정적 가이드라인은 난독화 및 다단계 컨텍스트를 놓치며, 인프라 샌드박스는 코드가 실행되는 위치를 제한하지만, 해당 행동의 의미를 이해하지 못합니다. 저희는 AgentTrust를 제안합니다. AgentTrust는 에이전트의 도구 호출을 실행 전에 가로채고, '허용', '경고', '차단', 또는 '검토'와 같은 구조화된 판단 결과를 반환하는 런타임 안전성 레이어입니다. AgentTrust는 셸 난독화 해제 정규화, 더 안전한 대안을 제안하는 SafeFix 기능, 다단계 공격 체인을 감지하는 RiskChain 기능, 그리고 모호한 입력에 대한 캐시 기반 LLM-as-Judge 기능을 결합합니다. 저희는 6가지 위험 카테고리에 걸쳐 300개의 시나리오와 630개의 독립적으로 구성된 실제 적대적 시나리오로 구성된 벤치마크를 공개합니다. 내부 벤치마크에서, 생산 환경에만 적용되는 규칙 세트는 낮은 밀리초 단위의 전체 지연 시간으로 95.0%의 판단 정확도와 73.7%의 위험 수준 정확도를 달성합니다. 수정된 규칙 세트로 평가된 630개의 시나리오 벤치마크에서, AgentTrust는 96.7%의 판단 정확도를 달성했으며, 특히 셸로 난독화된 페이로드에 대해 약 93%의 정확도를 보였습니다. AgentTrust는 AGPL-3.0 라이선스 하에 배포되며, MCP 호환 에이전트를 위한 Model Context Protocol 서버를 제공합니다.

Original Abstract

Modern AI agents execute real-world side effects through tool calls such as file operations, shell commands, HTTP requests, and database queries. A single unsafe action, including accidental deletion, credential exposure, or data exfiltration, can cause irreversible harm. Existing defenses are incomplete: post-hoc benchmarks measure behavior after execution, static guardrails miss obfuscation and multi-step context, and infrastructure sandboxes constrain where code runs without understanding what an action means. We present AgentTrust, a runtime safety layer that intercepts agent tool calls before execution and returns a structured verdict: allow, warn, block, or review. AgentTrust combines a shell deobfuscation normalizer, SafeFix suggestions for safer alternatives, RiskChain detection for multi-step attack chains, and a cache-aware LLM-as-Judge for ambiguous inputs. We release a 300-scenario benchmark across six risk categories and an additional 630 independently constructed real-world adversarial scenarios. On the internal benchmark, the production-only ruleset achieves 95.0% verdict accuracy and 73.7% risk-level accuracy at low-millisecond end-to-end latency. On the 630-scenario benchmark, evaluated under a patched ruleset and not claimed as zero-shot, AgentTrust achieves 96.7% verdict accuracy, including about 93% on shell-obfuscated payloads. AgentTrust is released under the AGPL-3.0 license and provides a Model Context Protocol server for MCP-compatible agents.

2 Citations

0 Influential

1.5 Altmetric

9.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!