2603.05786v1 Mar 06, 2026 cs.CR

AI 에이전트의 안전 장치 검증 및 그로부터 무엇을 (무엇을) 신뢰할 수 있는가

Proof-of-Guardrail in AI Agents and What (Not) to Trust from It

Michael Duan

Citations: 221

h-index: 3

Xisen Jin

Citations: 493

h-index: 3

Qin Lin

Citations: 3

h-index: 1

Aaron Chan

Citations: 26

h-index: 2

Zhenglun Chen

Citations: 20

h-index: 3

Junyi Du

University of Southern California

Citations: 572

h-index: 11

Xiang Ren

Citations: 16

h-index: 3

AI 에이전트가 온라인 서비스로 널리 사용됨에 따라, 사용자들은 종종 에이전트 개발자가 주장하는 안전 기능 구현 방식에 의존하게 되는데, 이는 안전 조치가 허위로 광고될 수 있는 위협을 야기합니다. 이러한 위협에 대응하기 위해, 우리는 '안전 장치 검증(proof-of-guardrail)'이라는 시스템을 제안합니다. 이 시스템은 개발자가 특정 오픈 소스 안전 장치를 통해 응답이 생성되었음을 암호학적으로 증명할 수 있도록 합니다. 증명을 생성하기 위해, 개발자는 에이전트와 안전 장치를 Trusted Execution Environment (TEE) 내에서 실행하며, 이를 통해 TEE에 의해 서명된 안전 장치 실행 증명서를 생성합니다. 이 증명서는 모든 사용자가 오프라인 상태에서 검증할 수 있습니다. 우리는 OpenClaw 에이전트에 대한 안전 장치 검증 시스템을 구현하고, 지연 시간 증가 및 배포 비용을 평가했습니다. 안전 장치 검증은 안전 장치 실행의 무결성을 보장하는 동시에 개발자의 에이전트를 비공개로 유지하지만, 악의적인 개발자가 의도적으로 안전 장치를 무력화하는 경우 안전에 대한 기만 위험이 존재한다는 점을 강조합니다. 코드 및 데모 영상: https://github.com/SaharaLabsAI/Verifiable-ClawGuard

Original Abstract

As AI agents become widely deployed as online services, users often rely on an agent developer's claim about how safety is enforced, which introduces a threat where safety measures are falsely advertised. To address the threat, we propose proof-of-guardrail, a system that enables developers to provide cryptographic proof that a response is generated after a specific open-source guardrail. To generate proof, the developer runs the agent and guardrail inside a Trusted Execution Environment (TEE), which produces a TEE-signed attestation of guardrail code execution verifiable by any user offline. We implement proof-of-guardrail for OpenClaw agents and evaluate latency overhead and deployment cost. Proof-of-guardrail ensures integrity of guardrail execution while keeping the developer's agent private, but we also highlight a risk of deception about safety, for example, when malicious developers actively jailbreak the guardrail. Code and demo video: https://github.com/SaharaLabsAI/Verifiable-ClawGuard

3 Citations

3 Influential

25.5 Altmetric

136.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!