2603.17639v1 Mar 18, 2026 cs.AI

VeriGrey: 그레이박스 기반 에이전트 검증

VeriGrey: Greybox Agent Validation

Ruijie Meng

Citations: 474

h-index: 9

S. Kang

Citations: 0

h-index: 0

Abhik Roychoudhury

Citations: 2,337

h-index: 18

Yuntong Zhang

Citations: 434

h-index: 6

Marcel Böhme

Citations: 11

h-index: 2

최근 에이전트 AI는 큰 관심을 받고 있는 분야입니다. Large Language Model (LLM) 에이전트는 백엔드에 하나 이상의 LLM을 포함하며, 프론트엔드에서는 LLM의 출력과 여러 외부 도구의 결과를 결합하여 자율적인 의사 결정을 수행합니다. 이러한 외부 환경과의 자율적인 상호 작용은 중요한 보안 위험을 초래합니다. 본 논문에서는 LLM 에이전트의 다양한 동작을 탐색하고 보안 위험을 발견하기 위한 그레이박스 접근 방식을 제시합니다. 저희가 제안하는 VeriGrey는 테스트 프로세스를 제어하기 위해 호출되는 도구의 시퀀스를 피드백 함수로 사용합니다. 이를 통해 예상치 못한 에이전트 동작을 유발하는 드물지만 위험한 도구 호출을 발견하는 데 도움이 됩니다. 테스트 프로세스에서 돌연변이 연산자로, 저희는 악의적인 삽입 프롬프트를 설계하기 위해 프롬프트를 수정합니다. 이 작업은 에이전트의 기능을 완료하는 데 필요한 단계가 되도록 삽입 작업을 에이전트의 작업과 연결하여 신중하게 수행됩니다. 저희 접근 방식인 VeriGrey를 잘 알려진 AgentDojo 벤치마크에서 블랙박스 기반 접근 방식과 비교한 결과, GPT-4.1 백엔드를 사용하는 경우 간접적인 프롬프트 주입 취약점을 발견하는 데 33% 더 높은 효율성을 달성했습니다. 또한, 널리 사용되는 코딩 에이전트인 Gemini CLI 및 잘 알려진 개인 비서인 OpenClaw에 대한 실제 사례 연구를 수행했습니다. VeriGrey는 블랙박스 접근 방식으로 식별할 수 없었던 여러 공격 시나리오를 유발하는 프롬프트를 발견했습니다. OpenClaw에서, 필요에 따라 돌연변이 퍼징 테스트를 사용하는 대화형 에이전트를 구축함으로써, VeriGrey는 10개의 악성 스킬 변종을 발견했습니다 (Kimi-K2.5 LLM 백엔드에서 10/10 = 100% 성공률, Opus 4.6 LLM 백엔드에서 9/10 = 90% 성공률). 이는 VeriGrey와 같은 동적 접근 방식이 에이전트를 테스트하는 데 얼마나 유용한지, 그리고 궁극적으로 에이전트 보안 프레임워크로 이어질 수 있는지 보여줍니다.

Original Abstract

Agentic AI has been a topic of great interest recently. A Large Language Model (LLM) agent involves one or more LLMs in the back-end. In the front end, it conducts autonomous decision-making by combining the LLM outputs with results obtained by invoking several external tools. The autonomous interactions with the external environment introduce critical security risks. In this paper, we present a grey-box approach to explore diverse behaviors and uncover security risks in LLM agents. Our approach VeriGrey uses the sequence of tools invoked as a feedback function to drive the testing process. This helps uncover infrequent but dangerous tool invocations that cause unexpected agent behavior. As mutation operators in the testing process, we mutate prompts to design pernicious injection prompts. This is carefully accomplished by linking the task of the agent to an injection task, so that the injection task becomes a necessary step of completing the agent functionality. Comparing our approach with a black-box baseline on the well-known AgentDojo benchmark, VeriGrey achieves 33% additional efficacy in finding indirect prompt injection vulnerabilities with a GPT-4.1 back-end. We also conduct real-world case studies with the widely used coding agent Gemini CLI, and the well-known OpenClaw personal assistant. VeriGrey finds prompts inducing several attack scenarios that could not be identified by black-box approaches. In OpenClaw, by constructing a conversation agent which employs mutational fuzz testing as needed, VeriGrey is able to discover malicious skill variants from 10 malicious skills (with 10/10= 100% success rate on the Kimi-K2.5 LLM backend, and 9/10= 90% success rate on Opus 4.6 LLM backend). This demonstrates the value of a dynamic approach like VeriGrey to test agents, and to eventually lead to an agent assurance framework.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!