2606.16802v1 Jun 15, 2026 cs.AI

LabOSBench: Benchmarking Computer Use Agents for Scientific Instrument Control

Zhaoyang Liu

Citations: 125

h-index: 4

Ben Fei

Citations: 205

h-index: 9

Han Deng

Citations: 13

h-index: 2

Anqi Zou

Citations: 9

h-index: 2

Wanli Ouyang

Citations: 33

h-index: 2

Chengyun Zhang

Citations: 1

h-index: 1

Junquan Hu

Citations: 5

h-index: 1

Yu Wang

Citations: 0

h-index: 0

Yuxiang Xing

Citations: 29

h-index: 2

Aokai Zhang

Citations: 4

h-index: 2

Hanling Zhang

Citations: 170

h-index: 5

Zhihui Wang

Citations: 17

h-index: 3

Current computer-use benchmarks primarily focus on software operation tasks in virtualized systems, whereas scientific instrumentation scenarios require coordinated control over complex interfaces, and feedback-driven parameter adjustment. However, directly evaluating agents on physical high-precision instruments is impractical due to high cost, safety risks, limited accessibility, and difficulty in ensuring reproducible evaluation. This motivates the need for a simulated yet realistic testbed that preserves the operational challenges of scientific instruments while enabling scalable and safe benchmarking. To this end, we introduce LabOSBench, a challenging benchmark for multimodal GUI agents built on a suite of web-based scientific-instrument simulators. Operating directly via a browser, LabOSBench avoids resource-heavy OS virtualization while supporting flexible task configuration and execution-based evaluation. Specifically, LabOSBench constructs 96 subtasks across eight instrument simulators, covering workflows from sample loading, alignment, parameter tuning, and data acquisition to result inspection. We evaluate general-purpose vision-language models, specialized GUI agent models, and advanced agentic frameworks at both subtask and end-to-end levels. Our experiments reveal that while existing agents can complete many structured GUI subtasks, they still struggle with feedback-driven operations and long-horizon workflow execution. Overall, LabOSBench provides a reproducible, low-cost testbed for advancing computer-using agents toward scientific-instrument control.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!