2604.07223v1 Apr 08, 2026 cs.CR

TraceSafe: 다단계 도구 사용 경로에 대한 LLM 안전 장치에 대한 체계적인 평가

TraceSafe: A Systematic Assessment of LLM Guardrails on Multi-Step Tool-Calling Trajectories

Cheng Yang

Citations: 109

h-index: 4

Yen-Shan Chen

Citations: 8

h-index: 2

Yun-Nung Chen

Citations: 3

h-index: 1

Sian-Yao Huang

Citations: 10

h-index: 2

대규모 언어 모델(LLM)이 정적인 챗봇에서 자율적인 에이전트로 진화함에 따라, 주요 취약점 영역은 최종 결과물에서 중간 실행 추적으로 이동하고 있습니다. 안전 장치는 자연어 응답에 대한 효과성이 잘 검증되었지만, 다단계 도구 사용 경로에서의 효과성은 아직 충분히 연구되지 않았습니다. 이러한 격차를 해소하기 위해, 우리는 중간 단계의 안전성을 평가하도록 특별히 설계된 최초의 종합 벤치마크인 TraceSafe-Bench를 소개합니다. 이 벤치마크는 보안 위협(예: 프롬프트 주입, 개인 정보 유출)에서부터 운영 실패(예: 환각, 인터페이스 불일치)에 이르기까지 12가지 위험 범주를 포함하며, 1,000개 이상의 고유한 실행 인스턴스를 특징으로 합니다. 13개의 LLM 기반 안전 장치 모델과 7개의 특수 안전 장치에 대한 평가 결과, 세 가지 중요한 사실을 발견했습니다. 1) 구조적 병목 현상: 안전 장치의 효과는 의미적 안전성 정렬보다 구조적 데이터 처리 능력(예: JSON 파싱)에 더 크게 좌우됩니다. 성능은 구조화된 데이터를 텍스트로 변환하는 벤치마크와 강한 상관 관계($ρ=0.79$)를 보이지만, 표준적인 탈옥(jailbreak) 방어 능력과는 거의 상관 관계가 없습니다. 2) 아키텍처가 규모보다 중요: 모델 크기보다 모델 아키텍처가 위험 감지 성능에 더 큰 영향을 미치며, 범용 LLM이 트래jectory 분석에서 특수 안전 장치보다 일관되게 더 높은 성능을 보입니다. 3) 시간적 안정성: 정확도는 확장된 트래jectory에서도 안정적으로 유지됩니다. 실행 단계가 증가함에 따라 모델은 정적인 도구 정의에서 동적인 실행 동작으로 전환하여, 후반 단계에서 위험 감지 성능을 실제로 향상시킵니다. 우리의 연구 결과는 에이전트 기반 워크플로우를 보호하려면 구조적 추론과 안전성 정렬을 동시에 최적화하여 중간 단계의 위험을 효과적으로 완화해야 함을 시사합니다.

Original Abstract

As large language models (LLMs) evolve from static chatbots into autonomous agents, the primary vulnerability surface shifts from final outputs to intermediate execution traces. While safety guardrails are well-benchmarked for natural language responses, their efficacy remains largely unexplored within multi-step tool-use trajectories. To address this gap, we introduce TraceSafe-Bench, the first comprehensive benchmark specifically designed to assess mid-trajectory safety. It encompasses 12 risk categories, ranging from security threats (e.g., prompt injection, privacy leaks) to operational failures (e.g., hallucinations, interface inconsistencies), featuring over 1,000 unique execution instances. Our evaluation of 13 LLM-as-a-guard models and 7 specialized guardrails yields three critical findings: 1) Structural Bottleneck: Guardrail efficacy is driven more by structural data competence (e.g., JSON parsing) than semantic safety alignment. Performance correlates strongly with structured-to-text benchmarks ($ρ=0.79$) but shows near-zero correlation with standard jailbreak robustness. 2) Architecture over Scale: Model architecture influences risk detection performance more significantly than model size, with general-purpose LLMs consistently outperforming specialized safety guardrails in trajectory analysis. 3) Temporal Stability: Accuracy remains resilient across extended trajectories. Increased execution steps allow models to pivot from static tool definitions to dynamic execution behaviors, actually improving risk detection performance in later stages. Our findings suggest that securing agentic workflows requires jointly optimizing for structural reasoning and safety alignment to effectively mitigate mid-trajectory risks.

3 Citations

0 Influential

2 Altmetric

13.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!