2601.13518v2 Jan 20, 2026 cs.AI

AgenticRed: 자동화된 레드 팀 운영을 위한 에이전트 기반 시스템 최적화

AgenticRed: Optimizing Agentic Systems for Automated Red-teaming

Jiayi Yuan

Citations: 2

h-index: 1

Natasha Jaques

Citations: 5,408

h-index: 32

Jonathan Nother

Citations: 0

h-index: 0

Goran Radanovi'c

Citations: 43

h-index: 3

최근의 자동화된 레드 팀 운영 방법은 모델의 취약점을 체계적으로 파악하는 데 유망하지만, 대부분의 기존 접근 방식은 사람이 직접 설계한 워크플로우에 의존합니다. 이러한 수동으로 설계된 워크플로우는 인간의 편향으로 인해 어려움을 겪으며, 더 넓은 설계 공간을 탐색하는 데 많은 비용이 소요됩니다. 본 논문에서는 AgenticRed를 소개합니다. AgenticRed는 LLM의 인컨텍스트 학습을 활용하여 인간의 개입 없이 레드 팀 시스템을 반복적으로 설계하고 개선하는 자동화 파이프라인입니다. AgenticRed는 미리 정의된 구조 내에서 공격자 정책을 최적화하는 대신, 레드 팀 운영을 시스템 설계 문제로 간주합니다. Meta Agent Search와 같은 방법에서 영감을 받아, 우리는 진화 선택을 사용하여 에이전트 기반 시스템을 발전시키는 새로운 절차를 개발하고, 이를 자동화된 레드 팀 운영 문제에 적용했습니다. AgenticRed가 설계한 레드 팀 시스템은 최첨단 접근 방식보다 일관되게 우수한 성능을 보이며, Llama-2-7B에서 96%의 공격 성공률(ASR)을 달성했습니다(36% 향상). Llama-3-8B에서는 98%의 ASR을 달성했습니다. 본 연구의 접근 방식은 독점 모델에도 강한 일반화 성능을 보여주며, GPT-3.5-Turbo 및 GPT-4o에서는 100%의 ASR을, Claude-Sonnet-3.5에서는 60%의 ASR을 달성했습니다(24% 향상). 본 연구는 자동화된 시스템 설계가 빠르게 진화하는 모델에 대응할 수 있는 강력한 인공지능 안전성 평가 패러다임임을 강조합니다.

Original Abstract

While recent automated red-teaming methods show promise for systematically exposing model vulnerabilities, most existing approaches rely on human-specified workflows. This dependence on manually designed workflows suffers from human biases and makes exploring the broader design space expensive. We introduce AgenticRed, an automated pipeline that leverages LLMs' in-context learning to iteratively design and refine red-teaming systems without human intervention. Rather than optimizing attacker policies within predefined structures, AgenticRed treats red-teaming as a system design problem. Inspired by methods like Meta Agent Search, we develop a novel procedure for evolving agentic systems using evolutionary selection, and apply it to the problem of automatic red-teaming. Red-teaming systems designed by AgenticRed consistently outperform state-of-the-art approaches, achieving 96% attack success rate (ASR) on Llama-2-7B (36% improvement) and 98% on Llama-3-8B on HarmBench. Our approach exhibits strong transferability to proprietary models, achieving 100% ASR on GPT-3.5-Turbo and GPT-4o, and 60% on Claude-Sonnet-3.5 (24% improvement). This work highlights automated system design as a powerful paradigm for AI safety evaluation that can keep pace with rapidly evolving models.

0 Citations

0 Influential

16 Altmetric

80.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!