2603.24511v1 Mar 25, 2026 cs.LG

클라우디니: 자가 연구 시스템이 LLM을 위한 최첨단 적대적 공격 알고리즘을 발견하다

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Maksym Andriushchenko

Citations: 110

h-index: 5

Y. Montjoye

Citations: 150

h-index: 6

Alexander Panfilov

Max Planck Institute for Intelligent Systems

Citations: 118

h-index: 6

Peter Romov

Citations: 270

h-index: 6

Igor Shilov

Citations: 153

h-index: 6

Jonas Geiping

Citations: 1,189

h-index: 15

클로드 코드와 같은 LLM 에이전트는 코드를 작성하는 것 외에도 자율적인 AI 연구 및 엔지니어링에 사용될 수 있습니다 extbackslash cite{rank2026posttrainbench, novikov2025alphaevolve}. 우리는 클로드 코드를 기반으로 하는 extit{자가 연구} 스타일 파이프라인 extbackslash cite{karpathy2026autoresearch}이 새로운 화이트박스 적대적 공격 extit{알고리즘}을 발견하여, 탈옥(jailbreaking) 및 프롬프트 주입 평가에서 기존의 30개 이상의 방법보다 extbf{현저하게 뛰어난 성능}을 보인다는 것을 보여줍니다. 기존의 공격 구현 방식(예: GCG extbackslash cite{zou2023universal})을 기반으로, 에이전트는 새로운 알고리즘을 생성하여 GPT-OSS-Safeguard-20B에 대한 CBRN 쿼리에서 최대 40%의 공격 성공률을 달성했습니다. 이는 기존 알고리즘의 경우 $\leq$10%에 비해 매우 높은 수치입니다 (그림 ef{fig:teaser}, 왼쪽). 발견된 알고리즘은 일반화 성능이 우수합니다. 서브 모델에 최적화된 공격은 별도의 모델로 직접 전송되어, extbf{Meta-SecAlign-70B에 대해 100%의 공격 성공률}을 달성했습니다 extbackslash cite{chen2025secalign}. 이는 최적의 기준 성능인 56%보다 훨씬 높은 수치입니다 (그림 ef{fig:teaser}, 중앙). extbackslash cite{carlini2025autoadvexbench}의 연구 결과를 확장하여, 우리의 결과는 LLM 에이전트를 사용하여 점진적인 안전 및 보안 연구를 자동화할 수 있다는 초기 사례를 보여줍니다. 특히, 화이트박스 적대적 레드 팀 활동은 이러한 자동화에 매우 적합합니다. 기존 방법은 강력한 출발점을 제공하며, 최적화 목표는 정량적인 피드백을 제공합니다. 우리는 발견된 모든 공격과 함께 기준 구현 및 평가 코드를 https://github.com/romovpa/claudini 에서 공개합니다.

Original Abstract

LLM agents like Claude Code can not only write code but also be used for autonomous AI research and engineering \citep{rank2026posttrainbench, novikov2025alphaevolve}. We show that an \emph{autoresearch}-style pipeline \citep{karpathy2026autoresearch} powered by Claude Code discovers novel white-box adversarial attack \textit{algorithms} that \textbf{significantly outperform all existing (30+) methods} in jailbreaking and prompt injection evaluations. Starting from existing attack implementations, such as GCG~\citep{zou2023universal}, the agent iterates to produce new algorithms achieving up to 40\% attack success rate on CBRN queries against GPT-OSS-Safeguard-20B, compared to $\leq$10\% for existing algorithms (\Cref{fig:teaser}, left). The discovered algorithms generalize: attacks optimized on surrogate models transfer directly to held-out models, achieving \textbf{100\% ASR against Meta-SecAlign-70B} \citep{chen2025secalign} versus 56\% for the best baseline (\Cref{fig:teaser}, middle). Extending the findings of~\cite{carlini2025autoadvexbench}, our results are an early demonstration that incremental safety and security research can be automated using LLM agents. White-box adversarial red-teaming is particularly well-suited for this: existing methods provide strong starting points, and the optimization objective yields dense, quantitative feedback. We release all discovered attacks alongside baseline implementations and evaluation code at https://github.com/romovpa/claudini.

6 Citations

1 Influential

27.5 Altmetric

145.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!