2606.16802v1 Jun 15, 2026 cs.AI

LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control

Zhaoyang Liu
Zhaoyang Liu
Citations: 125
h-index: 4
Ben Fei
Ben Fei
Citations: 205
h-index: 9
Han Deng
Han Deng
Citations: 13
h-index: 2
Anqi Zou
Anqi Zou
Citations: 9
h-index: 2
Wanli Ouyang
Wanli Ouyang
Citations: 33
h-index: 2
Chengyun Zhang
Chengyun Zhang
Citations: 1
h-index: 1
Junquan Hu
Junquan Hu
Citations: 5
h-index: 1
Yu Wang
Yu Wang
Citations: 0
h-index: 0
Yuxiang Xing
Yuxiang Xing
Citations: 29
h-index: 2
Aokai Zhang
Aokai Zhang
Citations: 4
h-index: 2
Hanling Zhang
Hanling Zhang
Citations: 170
h-index: 5
Zhihui Wang
Zhihui Wang
Citations: 17
h-index: 3

Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.

0 Citations
0 Influential
4.5 Altmetric
22.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!