2605.06455v1 May 07, 2026 cs.AI

PrefixGuard: LLM 에이전트 추적 데이터를 활용한 실시간 오류 경고 모니터링 시스템

PrefixGuard: From LLM-Agent Traces to Online Failure-Warning Monitors

Jinwei Hu

Citations: 465

h-index: 8

Xinmiao Huang

Citations: 13

h-index: 2

Xiaowei Huang

Citations: 239

h-index: 7

Yihong Dong

Peking University

Citations: 2,148

h-index: 20

Rajarshi Roy

Citations: 2

h-index: 1

Changshun Wu

Citations: 148

h-index: 4

최근 대규모 언어 모델(LLM) 에이전트는 복잡하고 다양한 도구를 활용하여 작업을 수행하며, 최종 결과 검증이 지연되어 문제 해결이 어려울 수 있습니다. 따라서 실시간 경고 시스템은 다양한 데이터 흐름에 대한 가벼운 모니터링 기능을 제공해야 하지만, 수동으로 작성된 이벤트 스키마는 유지 관리가 어렵고, 배포 시점에 LLM을 활용한 판단은 비용이 많이 듭니다. 본 논문에서는 PrefixGuard를 제안합니다. PrefixGuard는 오프라인 단계에서 StepView를 사용하여 추적 데이터로부터 정형화된 단계 어댑터를 생성하고, 이를 바탕으로 지도 학습을 통해 모니터를 훈련하는 추적 데이터-모니터링 프레임워크입니다. StepView는 원시 추적 데이터 샘플로부터 결정론적인 타입화된 단계 어댑터를 유도하며, 모니터는 최종 결과로부터 이벤트 추상화 및 위험 예측 점수를 학습합니다. WebArena, $τ^2$-Bench, SkillsBench, 및 TerminalBench 데이터셋에 대한 실험 결과, 가장 강력한 PrefixGuard 모니터는 각각 0.900, 0.710, 0.533, 0.557의 AUPRC 값을 달성했습니다. 각 표현 방식에서 가장 성능이 좋은 백엔드를 사용했을 때, PrefixGuard는 원본 텍스트 기반 시스템보다 평균 +0.137의 AUPRC 성능 향상을 보였습니다. 동일한 사전 경고 프로토콜 하에서 LLM 판단은 여전히 상대적으로 낮은 성능을 보였습니다. 또한, 정밀-재현율 곡선(AUPRC) 기반 점수 시스템에서 모니터 오류와 관찰된 단계에서 증거가 없는 실패를 구분하는 관측 가능성 상한을 도출했습니다. 유한 상태 감사에서는 WebArena 및 $τ^2$-Bench 데이터셋에서 DFA 추출이 비교적 작은 규모(각각 29개 및 20개 상태)로 유지되지만, SkillsBench 및 TerminalBench 데이터셋에서는 각각 151개 및 187개 상태로 확장되었습니다. 추가적으로, 초기 경고 진단 결과, 높은 순위가 반드시 배포에 유용성을 의미하지 않는다는 것을 확인했습니다. 예를 들어, WebArena는 높은 순위를 보이지만 낮은 오탐율 경고를 지원하지 못하는 반면, $τ^2$-Bench 및 TerminalBench는 더 많은 실행 가능한 초기 경고를 제공합니다. 이러한 결과들을 종합적으로 고려할 때, PrefixGuard는 실용적인 모니터링 시스템 구축을 위한 효과적인 방법이며, 사전 경고가 실제적인 개입으로 이어지는 경우를 진단하는 데 유용한 도구입니다.

Original Abstract

Large language model (LLM) agents now execute long, tool-using tasks where final outcome checks can arrive too late for intervention. Online warning requires lightweight prefix monitors over heterogeneous traces, but hand-authored event schemas are brittle and deployment-time LLM judging is costly. We introduce PrefixGuard, a trace-to-monitor framework with an offline StepView induction step followed by supervised monitor training. StepView induces deterministic typed-step adapters from raw trace samples, and the monitor learns an event abstraction and prefix-risk scorer from terminal outcomes. Across WebArena, $τ^2$-Bench, SkillsBench, and TerminalBench, the strongest PrefixGuard monitors reach 0.900/0.710/0.533/0.557 AUPRC. Using the strongest backend within each representation, they improve over raw-text controls by an average of +0.137 AUPRC. LLM judges remain substantially weaker under the same prefix-warning protocol. We also derive an observability ceiling on score-based area under the precision-recall curve (AUPRC) that separates monitor error from failures lacking evidence in the observed prefix. For finite-state audit, post-hoc deterministic finite automaton (DFA) extraction remains compact on WebArena and $τ^2$-Bench (29 and 20 states) but expands to 151 and 187 states on SkillsBench and TerminalBench. Finally, first-alert diagnostics show that strong ranking does not imply deployment utility: WebArena ranks well yet fails to support low-false-alarm alerts, whereas $τ^2$-Bench and TerminalBench retain more actionable early alerts. Together, these results position PrefixGuard as a practical monitor-synthesis recipe with explicit diagnostics for when prefix warnings translate into actionable interventions.

0 Citations

0 Influential

10 Altmetric

50.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!