2606.11063v1 Jun 09, 2026 cs.AI

CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs

Thomas Jiralerspong
Thomas Jiralerspong
Citations: 178
h-index: 6
Y. Bengio
Y. Bengio
Citations: 1,243
h-index: 11
Alexander Panfilov
Alexander Panfilov
Max Planck Institute for Intelligent Systems
Citations: 118
h-index: 6
Roland S. Zimmermann
Roland S. Zimmermann
Citations: 3,292
h-index: 4
Jonas Geiping
Jonas Geiping
Citations: 1,189
h-index: 15
Joachim Schaeffer
Joachim Schaeffer
Citations: 176
h-index: 8
Guillaume Lajoie
Guillaume Lajoie
Citations: 252
h-index: 8

AI control protocols oversee untrusted models by monitoring their actions and modifying potentially unsafe steps, often using a trusted model. This partially tampers with the untrusted model's trajectory. If the trusted model detects such an intervention, it may infer properties of the monitor and adapt to evade control. We introduce \textbf{CIAware-Bench}, a benchmark for measuring \textbf{c}ontrol \textbf{i}ntervention (CI) awareness across frontier models. CIAware-Bench tests whether models can distinguish their own trajectories from those modified by a control intervention. The benchmark is comprised of a suite of four task domains (essay writing, BigCodeBench, Bash Arena, and SHADE-Arena), while varying trajectory watermarking, side-task presence, and the control protocol. Evaluating eleven frontier models, we find low to moderate CI awareness under default settings (up to 0.87; random chance balanced binary classification accuracy is 0.5) with substantial variation across task domains and model pairs. Detection is generally easier across model families, suggesting that models exploit provider-specific differences in style or post-training. Overall, CI awareness is not a fixed model-level property, and should be measured for each new model release and deployment scenario. We release CIAware-Bench to track CI awareness and inform control protocols whose interventions are harder to detect.

0 Citations
0 Influential
7.5 Altmetric
37.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!