2604.01438v2 Apr 01, 2026 cs.AI

ClawSafety: "안전한" LLM, 하지만 위험한 에이전트

ClawSafety: "Safe" LLMs, Unsafe Agents

Jinhao Pan

Citations: 15

h-index: 2

Bowen Wei

George Mason University

Citations: 26

h-index: 3

Yunbei Zhang

Citations: 160

h-index: 8

Kai Mei

Citations: 1,585

h-index: 13

Jihun Hamm

Citations: 118

h-index: 6

Yingqiang Ge

Citations: 4,497

h-index: 30

Xiao Wang

Citations: 4

h-index: 1

Ziwei Zhu

Citations: 24

h-index: 3

OpenClaw과 같은 개인 AI 에이전트는 사용자의 로컬 머신에서 높은 권한으로 실행되므로, 단 하나의 성공적인 프롬프트 주입 공격으로 인해 자격 증명이 유출되거나, 금융 거래가 리디렉션되거나, 파일이 삭제될 수 있습니다. 이러한 위협은 기존의 텍스트 수준의 공격 방어 체계를 훨씬 뛰어넘지만, 현재의 안전성 평가 방법은 이러한 점을 제대로 반영하지 못합니다. 대부분의 평가는 격리된 채팅 환경에서 모델을 테스트하고, 인공적인 환경에 의존하며, 에이전트 프레임워크 자체가 안전성에 미치는 영향을 고려하지 않습니다. 우리는 120개의 적대적 테스트 시나리오로 구성된 CLAWSAFETY 벤치마크를 소개합니다. 이 벤치마크는 세 가지 차원(위험 영역, 공격 벡터, 악성 행위 유형)으로 구성되어 있으며, 소프트웨어 엔지니어링, 금융, 의료, 법률, DevOps 등 다양한 분야의 실제적인, 높은 권한을 가진 업무 환경을 기반으로 합니다. 각 테스트 케이스는 에이전트가 정상적인 업무 과정에서 접하는 세 가지 채널 중 하나에 적대적인 콘텐츠를 포함합니다. 여기에는 워크스페이스 스킬 파일, 신뢰할 수 있는 발신자로부터 받은 이메일, 웹 페이지가 포함됩니다. 우리는 다섯 가지 최첨단 LLM을 에이전트의 기반 모델로 사용하여 총 2,520개의 격리된 테스트를 수행했습니다. 공격 성공률(ASR)은 모델에 따라 40%에서 75%까지 다양하며, 공격 벡터에 따라 크게 달라집니다. 특히, 신뢰도가 가장 높은 스킬 지침은 이메일이나 웹 콘텐츠보다 훨씬 더 위험한 것으로 나타났습니다. 액션 추적 분석 결과, 가장 강력한 모델은 자격 증명 유출 및 파괴적인 행위에 대해 강력한 방어 체계를 유지하는 반면, 상대적으로 약한 모델은 이러한 공격을 허용하는 것으로 나타났습니다. 또한, 세 가지 에이전트 프레임워크에 대한 교차 테스트 결과는 안전성이 단순히 기반 모델에 의해 결정되는 것이 아니라, 전체 배포 스택에 따라 달라진다는 것을 보여줍니다. 따라서 모델과 프레임워크를 함께 고려하는 안전성 평가가 필요합니다. 코드 및 데이터는 다음 URL에서 확인할 수 있습니다: https://weibowen555.github.io/ClawSafety/.

Original Abstract

Personal AI agents like OpenClaw run with elevated privileges on users' local machines, where a single successful prompt injection can leak credentials, redirect financial transactions, or destroy files. This threat goes well beyond conventional text-level jailbreaks, yet existing safety evaluations fall short: most test models in isolated chat settings, rely on synthetic environments, and do not account for how the agent framework itself shapes safety outcomes. We introduce CLAWSAFETY, a benchmark of 120 adversarial test scenarios organized along three dimensions (harm domain, attack vector, and harmful action type) and grounded in realistic, high-privilege professional workspaces spanning software engineering, finance, healthcare, law, and DevOps. Each test case embeds adversarial content in one of three channels the agent encounters during normal work: workspace skill files, emails from trusted senders, and web pages. We evaluate five frontier LLMs as agent backbones, running 2,520 sandboxed trials across all configurations. Attack success rates (ASR) range from 40\% to 75\% across models and vary sharply by injection vector, with skill instructions (highest trust) consistently more dangerous than email or web content. Action-trace analysis reveals that the strongest model maintains hard boundaries against credential forwarding and destructive actions, while weaker models permit both. Cross-scaffold experiments on three agent frameworks further demonstrate that safety is not determined by the backbone model alone but depends on the full deployment stack, calling for safety evaluation that treats model and framework as joint variables. Code and data will be available at: https://weibowen555.github.io/ClawSafety/.

0 Citations

0 Influential

15 Altmetric

75.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!