2601.18491v1 Jan 26, 2026 cs.AI

AgentDoG: AI 에이전트 안전 및 보안을 위한 진단적 가드레일 프레임워크

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu

Citations: 374

h-index: 8

Qihan Ren

Citations: 382

h-index: 7

Chen Qian

Citations: 17

h-index: 2

Shuai Shao

Citations: 71

h-index: 4

Yuejin Xie

Citations: 48

h-index: 4

Yu Li

Citations: 17

h-index: 1

Zhonghao Yang

Citations: 27

h-index: 2

Haoyu Luo

Citations: 28

h-index: 3

Peng Wang

Citations: 29

h-index: 3

Qingyu Liu

Citations: 150

h-index: 3

Ling Tang

Citations: 33

h-index: 3

Jilin Mei

Citations: 22

h-index: 2

Dadi Guo

Citations: 21

h-index: 3

Lei Yuan

Citations: 34

h-index: 2

Junyao Yang

Citations: 66

h-index: 3

Guanxu Chen

Citations: 90

h-index: 6

Qihao Lin

Citations: 28

h-index: 2

Yi Yu

Citations: 48

h-index: 4

Bo Zhang

Citations: 44

h-index: 3

Jiaxuan Guo

Citations: 377

h-index: 10

Jie Zhang

Citations: 20

h-index: 2

Wenqi Shao

Citations: 19

h-index: 2

Huiqi Deng

Sun Yat-Sen University

Citations: 1,365

h-index: 14

Zhiheng Xi

Citations: 15

h-index: 1

Wenxuan Wang

Citations: 57

h-index: 4

Wen Shen

Citations: 17

h-index: 1

Zhikai Chen

Citations: 70

h-index: 5

Jialing Tao

Citations: 170

h-index: 7

Juntao Dai

Citations: 44

h-index: 4

Jiaming Ji

Citations: 45

h-index: 3

Linfeng Zhang

Citations: 167

h-index: 7

Quanshi Zhang

Citations: 160

h-index: 7

Lei Zhu

Citations: 25

h-index: 3

Zhihua Wei

Citations: 18

h-index: 2

Hui Xue

Citations: 131

h-index: 5

Chaochao Lu

Citations: 59

h-index: 4

Jing Shao

Citations: 74

h-index: 3

Xia Hu

Citations: 16

h-index: 1

Wenjie Wang

Citations: 31

h-index: 3

Yong Liu

Citations: 15

h-index: 1

Bin Hu

Citations: 287

h-index: 6

Zhongjie Ba

Citations: 1,707

h-index: 20

Haoyu Xie

Citations: 26

h-index: 3

AI 에이전트의 부상은 자율적인 도구 사용과 환경과의 상호작용에서 발생하는 복잡한 안전 및 보안 과제를 야기합니다. 기존 가드레일 모델들은 에이전트 특유의 위험에 대한 인식이 부족하고 위험 진단의 투명성이 결여되어 있습니다. 복잡하고 다양한 위험 행동을 포괄하는 에이전트 가드레일을 도입하기 위해, 본 논문에서는 먼저 에이전트 리스크를 원인(source), 실패 양상(failure mode), 결과(consequence)에 따라 직교적으로 분류하는 통합된 3차원 분류 체계를 제안합니다. 이 구조화되고 계층적인 분류 체계를 기반으로 새로운 세분화된 에이전트 안전 벤치마크(ATBench)와 에이전트 안전 및 보안을 위한 진단적 가드레일 프레임워크(AgentDoG)를 소개합니다. AgentDoG는 에이전트의 실행 경로(trajectory) 전반에 걸쳐 세밀하고 맥락적인 모니터링을 제공합니다. 무엇보다 중요한 점은 AgentDoG가 안전하지 않은 행동뿐만 아니라 겉보기에는 안전하지만 비합리적인 행동의 근본 원인을 진단할 수 있다는 것입니다. 이는 단순한 이진 레이블을 넘어 판단의 근거(provenance)와 투명성을 제공함으로써 효과적인 에이전트 정렬(alignment)을 촉진합니다. AgentDoG는 Qwen 및 Llama 모델 제품군에 걸쳐 세 가지 크기(4B, 7B, 8B 파라미터)로 제공됩니다. 광범위한 실험 결과, AgentDoG는 다양하고 복잡한 상호작용 시나리오의 에이전트 안전 중재에서 최고 수준의 성능(SOTA)을 달성함을 입증했습니다. 모든 모델과 데이터셋은 공개되었습니다.

Original Abstract

The rise of AI agents introduces complex safety and security challenges arising from autonomous tool use and environmental interactions. Current guardrail models lack agentic risk awareness and transparency in risk diagnosis. To introduce an agentic guardrail that covers complex and numerous risky behaviors, we first propose a unified three-dimensional taxonomy that orthogonally categorizes agentic risks by their source (where), failure mode (how), and consequence (what). Guided by this structured and hierarchical taxonomy, we introduce a new fine-grained agentic safety benchmark (ATBench) and a Diagnostic Guardrail framework for agent safety and security (AgentDoG). AgentDoG provides fine-grained and contextual monitoring across agent trajectories. More Crucially, AgentDoG can diagnose the root causes of unsafe actions and seemingly safe but unreasonable actions, offering provenance and transparency beyond binary labels to facilitate effective agent alignment. AgentDoG variants are available in three sizes (4B, 7B, and 8B parameters) across Qwen and Llama model families. Extensive experimental results demonstrate that AgentDoG achieves state-of-the-art performance in agentic safety moderation in diverse and complex interactive scenarios. All models and datasets are openly released.

15 Citations

2 Influential

10 Altmetric

69.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!