2605.29601v1 May 28, 2026 cs.CL

Training Deliberative Monitors for Black-Box Scheming Detection

Akshat D. Naik
Akshat D. Naik
Citations: 34
h-index: 3
Simon Storf
Simon Storf
Citations: 5
h-index: 1
Rich Barton-Cooper
Rich Barton-Cooper
Citations: 5
h-index: 1
Aditya Sinha
Aditya Sinha
Citations: 14
h-index: 2
Victor Gillioz
Victor Gillioz
Citations: 100
h-index: 4
Kilian Merkelbach
Kilian Merkelbach
Citations: 40
h-index: 2
Axel Højmark
Axel Højmark
Citations: 60
h-index: 3
Marius Hobbhahn
Marius Hobbhahn
Citations: 77
h-index: 2

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action-only deliberative monitors: smaller open-weight models trained to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or model internals. Our method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors with supervised fine-tuning and reinforcement learning. We train on five datasets, and evaluate across six out-of-distribution agentic misalignment benchmarks. We show that applying our method to Qwen3.5-27B yields higher performance than all low-cost frontier models as prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5) and than Gemini 2.5 Pro, while also achieving lower marginal inference cost (token-metered USD per 1,000 evaluations). Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, and Claude Opus 4.6) achieve higher performance but at roughly $16$--$34\times$ higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost--performance Pareto frontier among the monitors we evaluate, providing practical low-cost, low-FPR alternatives to prompted frontier models.

0 Citations
0 Influential
2 Altmetric
10.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!