2603.20925v1 Mar 21, 2026 cs.AI

이익을 극대화하는 적군: 전략적 경제적 상호작용에서의 에이전트 스트레스 테스트

Profit is the Red Team: Stress-Testing Agents in Strategic Economic Interactions

Samuele Marro

Citations: 116

h-index: 5

Shouqiao Wang

Citations: 19

h-index: 2

Marcello Politi

Citations: 18

h-index: 2

Davide Crapis

Citations: 135

h-index: 5

에이전트 시스템이 실제 환경에 적용됨에 따라, 에이전트의 의사 결정은 검색된 콘텐츠, 도구 출력, 그리고 다른 행위자로부터 제공되는 정보와 같은 외부 입력에 점점 더 의존하게 됩니다. 이러한 입력이 적대적인 행위자에 의해 전략적으로 조작될 수 있는 경우, 관련된 보안 위험은 고정된 프롬프트 공격 라이브러리를 넘어, 에이전트를 불리한 결과로 유도하는 적응적 전략으로 확장됩니다. 우리는 이익을 극대화하는 적군 공격(profit-driven red teaming)이라는 스트레스 테스트 프로토콜을 제안합니다. 이 프로토콜은 수동으로 제작된 공격을 대신하여, 스칼라 형태의 결과 피드백만을 사용하여 이익을 극대화하도록 학습된 적대자를 활용합니다. 이 프로토콜은 LLM 기반의 평가, 공격 레이블, 또는 공격 분류 체계가 필요 없으며, 감사 가능한 결과를 얻을 수 있는 구조화된 환경에 적합하도록 설계되었습니다. 우리는 이 프로토콜을 네 가지의 대표적인 경제적 상호작용을 포함하는 간단한 환경에서 구현하여, 적응적인 악용 가능성을 체계적으로 테스트할 수 있는 환경을 제공합니다. 통제된 실험에서, 정적인 기준에 강해 보이는 에이전트들이 이익 최적화된 압력 하에서 일관되게 악용 가능성이 드러났으며, 학습된 적대자는 명시적인 지시 없이도 탐색, 앵커링, 그리고 기만적인 약속 전략을 발견합니다. 그런 다음, 우리는 악용 사례를 에이전트를 위한 간결한 프롬프트 규칙으로 추출하여, 대부분의 이전에 관찰되었던 실패를 무효화하고 대상 성능을 크게 향상시켰습니다. 이러한 결과는 이익을 극대화하는 적군 공격 데이터를 활용하여, 감사 가능한 결과를 얻을 수 있는 구조화된 에이전트 환경에서 견고성을 향상시키는 실용적인 방법을 제공할 수 있음을 시사합니다.

Original Abstract

As agentic systems move into real-world deployments, their decisions increasingly depend on external inputs such as retrieved content, tool outputs, and information provided by other actors. When these inputs can be strategically shaped by adversaries, the relevant security risk extends beyond a fixed library of prompt attacks to adaptive strategies that steer agents toward unfavorable outcomes. We propose profit-driven red teaming, a stress-testing protocol that replaces handcrafted attacks with a learned opponent trained to maximize its profit using only scalar outcome feedback. The protocol requires no LLM-as-judge scoring, attack labels, or attack taxonomy, and is designed for structured settings with auditable outcomes. We instantiate it in a lean arena of four canonical economic interactions, which provide a controlled testbed for adaptive exploitability. In controlled experiments, agents that appear strong against static baselines become consistently exploitable under profit-optimized pressure, and the learned opponent discovers probing, anchoring, and deceptive commitments without explicit instruction. We then distill exploit episodes into concise prompt rules for the agent, which make most previously observed failures ineffective and substantially improve target performance. These results suggest that profit-driven red-team data can provide a practical route to improving robustness in structured agent settings with auditable outcomes.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!