2606.11078v1 Jun 09, 2026 cs.AI

A History-Aware Visually Grounded Critic for Computer Use Agents

Elias Stengel-Eskin
Elias Stengel-Eskin
Citations: 1,189
h-index: 19
Mohit Bansal
Mohit Bansal
Citations: 1,149
h-index: 20
Archiki Prasad
Archiki Prasad
UNC Chapel Hill
Citations: 941
h-index: 16
Sambit Sahu
Sambit Sahu
Citations: 87
h-index: 5
J. Chen
J. Chen
Citations: 988
h-index: 10
Supriyo Chakraborty
Supriyo Chakraborty
Citations: 11
h-index: 2
Zaid Khan
Zaid Khan
Citations: 243
h-index: 6
Kartik Balasubramaniam
Kartik Balasubramaniam
Citations: 2
h-index: 1
Jaewoo Lee
Jaewoo Lee
Citations: 3
h-index: 1
Hyunji Lee
Hyunji Lee
Citations: 21
h-index: 2

Various test-time interventions for Computer Use Agents (CUAs), including critic models, have been developed to improve performance through pre-execution action evaluation in complex Graphical User Interface (GUI) environments. However, existing critics suffer from two key limitations: they (1) focus primarily on short-sighted decision loops (e.g., forgetting earlier actions) and (2) lack the visual grounding needed to detect flawed actions (e.g., clicking wrong UI elements). To address these, we introduce HiViG, a History-aware Visually Grounded test-time framework, built around a multimodal critic trained on real GUI trajectories to abstract past interactions into a compact record and to evaluate actions with visual grounding. At test time, HiViG integrates the critic into the policy decision loop to provide macro-action history, which summarizes the policy's completed achievements, and visually grounded critique, which verifies raw execution coordinates against the current screenshot to intercept errors before execution. Across web, mobile, and desktop benchmarks, HiViG consistently outperforms existing scalar and verbal critics, improving average success rates over the strongest baseline by 5.8% for Qwen3-VL-32B and 9.0% for Gemini-3-Flash, and demonstrates strong cross-platform generalization. Ablations show that macro-action history mitigates short-sighted planning and visually grounded critique reduces execution errors, with both components being critical for test-time scaling in long-horizon GUI tasks.

0 Citations
0 Influential
10 Altmetric
50.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!