2604.24020v1 Apr 27, 2026 cs.CR

포스터: ClawdGo: 자율 AI 에이전트를 위한 내재적 보안 인식 훈련

Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Yangbin Yu

Citations: 211

h-index: 4

Jiaqi Li

Citations: 17

h-index: 3

Lidong Zhai

Citations: 10

h-index: 2

Yangyang Zhao

Citations: 6

h-index: 1

Binxue Sun

Citations: 4

h-index: 1

Jian Chang

Citations: 280

h-index: 9

OpenClaw과 같은 플랫폼에 배포된 자율 AI 에이전트는 프롬프트 주입, 메모리 오염, 공급망 공격 및 사회 공학 공격에 취약하지만, 기존의 방어 시스템은 플랫폼 경계에만 집중하여 에이전트 자체의 위협 판단 능력을 전혀 훈련시키지 않습니다. 본 논문에서는 ClawdGo라는 프레임워크를 제안합니다. ClawdGo는 에이전트가 추론 시 자체적으로 위협을 인식하고 판단하는 능력을 향상시키는 내재적 보안 인식 훈련 방법입니다. 이 방법은 모델 수정 없이 작동합니다. 본 논문에서는 다음과 같은 네 가지 주요 내용을 소개합니다. 첫째, TLDT(Three-Layer Domain Taxonomy)는 자체 방어, 소유자 보호 및 기업 보안의 세 가지 계층에 걸쳐 12가지 훈련 가능한 요소를 체계적으로 구성합니다. 둘째, ASAT(Autonomous Security Awareness Training)는 에이전트가 공격자, 방어자 및 평가자 역할을 번갈아 수행하는 자기 학습 루프로, 가장 취약한 부분부터 시작하여 점진적으로 훈련합니다. 셋째, CSMA(Cross-Session Memory Accumulation)는 4계층의 지속적인 메모리 아키텍처와 공리 결정 촉진(Axiom Crystallisation Promotion, ACP)을 통해 훈련 성과를 축적합니다. 넷째, SACP(Security Awareness Calibration Problem)는 내재적 훈련으로 인해 발생하는 정밀도-재현율 간의 균형을 공식화합니다. 실제 실험 결과, 가장 취약한 부분부터 시작하는 ASAT 훈련을 16번 진행했을 때, 평균 TLDT 점수가 80.9점에서 96.9점으로 향상되었으며, 이는 균일한 랜덤 방식으로 훈련할 때보다 6.5점 높은 수치이며, 12가지 요소 중 11가지에 대한 개선을 보였습니다. CSMA는 훈련 성과를 세션 간에 유지합니다. 반면, 초기 훈련을 생략했을 때에는 2.4점밖에 회복되지 않아 13.6점의 성능 차이가 발생했습니다. E-모드는 12가지 모든 요소에 해당하는 32개의 TLDT 규격에 부합하는 시나리오를 생성합니다. SACP는 훈련이 과도하게 진행된 에이전트가 합법적인 기능 평가를 프롬프트 주입으로 오분류하는 경우가 관찰되었습니다 (160번의 시도 중 30번).

Original Abstract

Autonomous AI agents deployed on platforms such as OpenClaw face prompt injection, memory poisoning, supply-chain attacks, and social engineering, yet existing defences address only the platform perimeter, leaving the agent's own threat judgement entirely untrained. We present ClawdGo, a framework for endogenous security awareness training: we teach the agent to recognise and reason about threats from the inside, at inference time, with no model modification. Four contributions are introduced: TLDT (Three-Layer Domain Taxonomy) organises 12 trainable dimensions across Self-Defence, Owner-Protection, and Enterprise-Security layers; ASAT (Autonomous Security Awareness Training) is a self-play loop where the agent alternates attacker, defender, and evaluator roles under weakest-first curriculum scheduling; CSMA (Cross-Session Memory Accumulation) compounds skill gains via a four-layer persistent memory architecture and Axiom Crystallisation Promotion (ACP); and SACP (Security Awareness Calibration Problem) formalises the precision-recall tradeoff introduced by endogenous training. Live experiments show weakest-first ASAT raises average TLDT score from 80.9 to 96.9 over 16 sessions, outperforming uniform-random scheduling by 6.5 points and covering 11 of 12 dimensions. CSMA retains the full gain across sessions; cold-start ablation recovers only 2.4 points, leaving a 13.6-point gap. E-mode generates 32 TLDT-conformant scenarios covering all 12 dimensions. SACP is observed when a heavily trained agent classifies a legitimate capability assessment as prompt injection (30/160).

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!