2602.02395v1 Feb 02, 2026 cs.LG

다윗과 골리앗: 강화 학습을 통한 검증 가능한 에이전트-에이전트 제어 우회 공격

David vs. Goliath: Verifiable Agent-to-Agent Jailbreaking via Reinforcement Learning

Citations: 6

h-index: 1

Citations: 632

h-index: 9

대규모 언어 모델이 자율 에이전트로 진화하면서, 합법적인 도구 권한을 악용하는 적대적 오류가 발생하고 있으며, 이는 도구 기반 환경에서의 안전성 평가를 주관적인 자연어 처리 작업에서 객관적인 제어 문제로 변화시킵니다. 우리는 이러한 위협 모델을 '태그-얼롱 공격(Tag-Along Attacks)'으로 공식화했습니다. 이는 도구가 없는 공격자가 안전에 맞춰 설계된 에이전트(Operator)의 신뢰할 수 있는 권한에 '같이 붙어' 대화를 통해 금지된 도구 사용을 유도하는 시나리오입니다. 이 위협을 검증하기 위해, 우리는 '콜드 스타트' 강화 학습 프레임워크인 Slingshot을 제시합니다. Slingshot은 자율적으로 새로운 공격 벡터를 발견하며, 중요한 통찰력을 제공합니다. 즉, 우리의 설정에서 학습된 공격은 다단계 설득보다는 짧고 명령형 구문 패턴으로 수렴하는 경향이 있습니다. Slingshot은 테스트 데이터셋의 매우 어려운 작업에서 Qwen2.5-32B-Instruct-AWQ Operator에 대해 67.0%의 성공률을 달성했습니다(기준 성능 1.7%). 또한, 해결된 작업에서 첫 번째 성공까지의 평균 시도 횟수를 52.3에서 1.3으로 줄였습니다. 더욱 중요한 점은, Slingshot이 Gemini 2.5 Flash(56.0% 공격 성공률)와 같은 폐쇄형 모델 및 Meta-SecAlign-8B(39.2% 공격 성공률)와 같은 방어적으로 미세 조정된 오픈 소스 모델을 포함한 여러 모델 패밀리로 제로샷 방식으로 전이된다는 것입니다. 우리의 연구는 태그-얼롱 공격을 검증 가능한 주요 위협 모델로 확립하고, 환경과의 상호 작용만으로도 일반적인 오픈 가중치 모델에서 효과적인 에이전트 공격을 유발할 수 있음을 보여줍니다.

Original Abstract

The evolution of large language models into autonomous agents introduces adversarial failures that exploit legitimate tool privileges, transforming safety evaluation in tool-augmented environments from a subjective NLP task into an objective control problem. We formalize this threat model as Tag-Along Attacks: a scenario where a tool-less adversary "tags along" on the trusted privileges of a safety-aligned Operator to induce prohibited tool use through conversation alone. To validate this threat, we present Slingshot, a 'cold-start' reinforcement learning framework that autonomously discovers emergent attack vectors, revealing a critical insight: in our setting, learned attacks tend to converge to short, instruction-like syntactic patterns rather than multi-turn persuasion. On held-out extreme-difficulty tasks, Slingshot achieves a 67.0% success rate against a Qwen2.5-32B-Instruct-AWQ Operator (vs. 1.7% baseline), reducing the expected attempts to first success (on solved tasks) from 52.3 to 1.3. Crucially, Slingshot transfers zero-shot to several model families, including closed-source models like Gemini 2.5 Flash (56.0% attack success rate) and defensive-fine-tuned open-source models like Meta-SecAlign-8B (39.2% attack success rate). Our work establishes Tag-Along Attacks as a first-class, verifiable threat model and shows that effective agentic attacks can be elicited from off-the-shelf open-weight models through environment interaction alone.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!