2602.05746v1 Feb 05, 2026 cs.LG

학습을 통한 공격: 강화 학습 기반 자동 프롬프트 주입

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

F. Tramèr

Citations: 1,509

h-index: 16

Xin Chen

Citations: 6

h-index: 1

Jie Zhang

Citations: 519

h-index: 10

프롬프트 주입은 LLM 에이전트에서 가장 중요한 취약점 중 하나이지만, 최적화 관점에서 효과적인 자동 공격 방법은 아직 거의 연구되지 않았습니다. 기존 방법은 주로 인간 전문가의 개입과 수동으로 제작된 프롬프트에 의존하여, 확장성과 적응성이 제한됩니다. 본 논문에서는 AutoInject라는 강화 학습 프레임워크를 제안합니다. AutoInject는 범용적이고 전이 가능한 적대적 접미사를 생성하며, 동시에 공격 성공률을 최적화하고 정상적인 작업에서의 유용성을 유지합니다. 저희의 블랙박스 방법은 쿼리 기반 최적화와 함께 새로운 모델 및 작업에 대한 전이 공격을 지원합니다. 1.5B 파라미터의 적대적 접미사 생성기를 사용하여 GPT 5 Nano, Claude Sonnet 3.5, 그리고 Gemini 2.5 Flash를 포함한 최첨단 시스템을 AgentDojo 벤치마크에서 성공적으로 공격했으며, 이는 자동 프롬프트 주입 연구를 위한 더욱 강력한 기준점을 제시합니다.

Original Abstract

Prompt injection is one of the most critical vulnerabilities in LLM agents; yet, effective automated attacks remain largely unexplored from an optimization perspective. Existing methods heavily depend on human red-teamers and hand-crafted prompts, limiting their scalability and adaptability. We propose AutoInject, a reinforcement learning framework that generates universal, transferable adversarial suffixes while jointly optimizing for attack success and utility preservation on benign tasks. Our black-box method supports both query-based optimization and transfer attacks to unseen models and tasks. Using only a 1.5B parameter adversarial suffix generator, we successfully compromise frontier systems including GPT 5 Nano, Claude Sonnet 3.5, and Gemini 2.5 Flash on the AgentDojo benchmark, establishing a stronger baseline for automated prompt injection research.

5 Citations

1 Influential

8 Altmetric

47.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!