2604.27776v1 Apr 30, 2026 cs.AI

WindowsWorld: 전문적인 멀티 애플리케이션 환경에서 자율 GUI 에이전트의 프로세스 중심 벤치마크

WindowsWorld: A Process-Centric Benchmark of Autonomous GUI Agents in Professional Cross-Application Environments

Baotian Hu

Citations: 326

h-index: 10

Jinchao Li

Citations: 1,574

h-index: 16

Yunxin Li

Harbin Institute of Technology, Shenzhen

Citations: 969

h-index: 13

Chen Zhao

Citations: 5

h-index: 1

Zhenran Xu

Citations: 258

h-index: 9

Min Zhang

Citations: 56

h-index: 5

GUI 에이전트가 OSWorld와 같은 일반적인 컴퓨터 사용 작업에서 인상적인 능력을 보여주었지만, 현재 벤치마크는 주로 격리된 단일 애플리케이션 작업에 초점을 맞추고 있습니다. 이는 실제 환경에서 복잡한 전문 분야 워크플로우를 수행하기 위해 여러 애플리케이션을 조정해야 하는 중요한 요구 사항을 간과합니다. 이러한 격차를 해소하기 위해, 우리는 실제 전문 활동을 반영하는 복잡하고 다단계 작업에 대한 GUI 에이전트의 성능을 체계적으로 평가하도록 설계된, 멀티 애플리케이션 워크플로우 벤치마크인 WindowsWorld를 제시합니다. 우리의 방법론은 16가지 직업을 기반으로 하는 멀티 에이전트 프레임워크를 사용하여 중간 검사를 포함하는 4가지 난이도 수준의 작업을 생성하고, 이를 인간 검토를 통해 개선한 후 시뮬레이션 환경에서 실행합니다. 결과적으로 생성된 벤치마크는 17개의 일반적인 데스크톱 애플리케이션에 걸쳐 평균 5.0개의 하위 목표를 포함하는 181개의 작업으로 구성되며, 이 중 78%가 본질적으로 멀티 애플리케이션 작업입니다. 선도적인 대규모 모델 및 에이전트에 대한 실험 결과는 다음과 같습니다. 1) 모든 컴퓨터 사용 에이전트는 멀티 애플리케이션 작업에서 매우 낮은 성공률(< 21%)을 보이며, 이는 간단한 단일 애플리케이션 작업의 성능에 훨씬 미치지 못합니다. 2) 조건부 판단과 추론이 필요한 ≥ 3개의 애플리케이션을 사용하는 작업에서 에이전트는 대부분 실패하며, 초기 하위 목표에서 정지됩니다. 3) 실행 효율성이 낮아, 작업이 인간의 단계 제한을 훨씬 초과했음에도 불구하고 자주 실패합니다. 코드, 벤치마크 데이터 및 평가 리소스는 github.com/HITsz-TMG/WindowsWorld에서 사용할 수 있습니다.

Original Abstract

While GUI agents have shown impressive capabilities in common computer-use tasks such as OSWorld, current benchmarks mainly focus on isolated and single-application tasks. This overlooks a critical real-world requirement of coordinating across multiple applications to accomplish complex profession-specific workflows. To bridge this gap, we present a computer-use benchmark in cross-application workflows, named WindowsWorld, designed to systematically assess GUI Agents on complex multi-step tasks that mirror real-world professional activities. Our methodology uses a multi-agent framework steered by 16 occupations to generate four difficulty-level tasks with intermediate inspection, which are then refined by human review and executed in a simulated environment. The resulting benchmark contains 181 tasks with an average of 5.0 sub-goals across 17 common desktop applications, of which 78% are inherently multi-application. Experimental results of leading large models and agents show that: 1) All computer-use agents perform poorly on multi-application tasks (< 21% success rate), far below the performance of simple single-app tasks; 2) They largely fail at tasks requiring conditional judgment and reasoning across $\geq$ 3 applications, stalling at early sub-goals; 3) Low execution efficiency, where tasks often fail despite far exceeding human step limits. Code, benchmark data, and evaluation resources are available at github.com/HITsz-TMG/WindowsWorld.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!