2604.21375v1 Apr 23, 2026 cs.CL

VLAA-GUI: 언제 멈출지, 복구할지, 그리고 탐색할지 결정하는, GUI 자동화를 위한 모듈형 프레임워크

VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

Zeyu Zheng

Citations: 250

h-index: 6

Cihang Xie

Citations: 156

h-index: 6

Huaxiu Yao

Citations: 218

h-index: 8

Yiyang Zhou

Citations: 1,487

h-index: 15

Caiming Xiong

Citations: 3,801

h-index: 27

Yuyin Zhou

Citations: 789

h-index: 14

Haoqin Tu

Citations: 1,013

h-index: 14

Zijun Wang

Citations: 324

h-index: 8

Q. Han

Citations: 78

h-index: 5

Haoyu Dai

Citations: 0

h-index: 0

Nancy Lau

Citations: 20

h-index: 3

Alvaro A. Cárdenas

Citations: 7

h-index: 1

Yuhui Xu

Citations: 1

h-index: 1

Ran Xu

Citations: 119

h-index: 4

자율적인 GUI 에이전트는 두 가지 근본적인 문제에 직면합니다. 첫째, 에이전트가 검증 가능한 증거 없이 조기에 성공을 선언하는 '조기 종료' 현상이고, 둘째, 에이전트가 동일한 실패 동작을 반복하며 복구 없이 계속되는 '반복 루프' 현상입니다. 본 논문에서는 세 가지 통합된 구성 요소로 시스템의 '정지(Stop)', '복구(Recover)', '탐색(Search)' 시점을 안내하는 모듈형 GUI 에이전트 프레임워크인 VLAA-GUI를 제안합니다. 첫째, 필수적인 '완전성 검증기(Completeness Verifier)'는 UI에서 관찰 가능한 성공 기준을 적용하고, 모든 완료 단계에서 검증을 수행합니다. 이 검증기는 에이전트 수준에서 완료 주장을 검증 규칙과 교차 검증하며, 직접적인 시각적 증거가 없는 주장은 거부합니다. 둘째, 필수적인 '루프 방지기(Loop Breaker)'는 다단계 필터링을 제공합니다. 이는 반복적인 실패 후 상호 작용 모드를 전환하고, 지속적인 화면 상태 반복 후 전략 변경을 강제하며, 반사 신호를 전략 변경에 연결합니다. 셋째, 필요에 따라 '탐색 에이전트(Search Agent)'는 검색 기능을 갖춘 LLM에 직접 쿼리를 보내 익숙하지 않은 워크플로우를 검색하고, 결과를 일반 텍스트로 반환합니다. 또한, 코드 집약적인 작업에 사용되는 '코딩 에이전트(Coding Agent)'와 정확한 작업 수행을 위한 '그라운딩 에이전트(Grounding Agent)'를 필요에 따라 통합적으로 활용합니다. 우리는 VLAA-GUI를 Opus 4.5, 4.6 및 Gemini 3.1 Pro를 포함한 다섯 가지 최상위 모델에서 Linux 및 Windows 작업을 수행하는 두 가지 벤치마크를 사용하여 평가했습니다. 그 결과, VLAA-GUI는 두 벤치마크 모두에서 최상위 성능을 달성했습니다 (OSWorld에서 77.5%, WindowsAgentArena에서 61.0%). 특히, 다섯 가지 모델 중 세 가지가 단일 실행에서 OSWorld 벤치마크에서 인간 성능(72.4%)을 능가했습니다. 추가 분석 결과, 세 가지 제안된 구성 요소는 강력한 모델의 성능을 꾸준히 향상시키는 반면, 약한 모델은 충분한 단계 예산이 확보된 경우 이러한 도구를 통해 더 큰 이점을 얻을 수 있습니다. 또한, '루프 방지기'는 루프에 취약한 모델에서 낭비되는 단계를 거의 절반으로 줄이는 것으로 나타났습니다.

Original Abstract

Autonomous GUI agents face two fundamental challenges: early stopping, where agents prematurely declare success without verifiable evidence, and repetitive loops, where agents cycle through the same failing actions without recovery. We present VLAA-GUI, a modular GUI agentic framework built around three integrated components that guide the system on when to Stop, Recover, and Search. First, a mandatory Completeness Verifier enforces UI-observable success criteria and verification at every finish step -- with an agent-level verifier that cross-examines completion claims with decision rules, rejecting those lacking direct visual evidence. Second, a mandatory Loop Breaker provides multi-tier filtering: switching interaction mode after repeated failures, forcing strategy changes after persistent screen-state recurrence, and binding reflection signals to strategy shifts. Third, an on-demand Search Agent searches online for unfamiliar workflows by directly querying a capable LLM with search ability, returning results as plain text. We additionally integrate a Coding Agent for code-intensive actions and a Grounding Agent for precise action grounding, both invoked on demand when required. We evaluate VLAA-GUI across five top-tier backbones, including Opus 4.5, 4.6 and Gemini 3.1 Pro, on two benchmarks with Linux and Windows tasks, achieving top performance on both (77.5% on OSWorld and 61.0% on WindowsAgentArena). Notably, three of the five backbones surpass human performance (72.4%) on OSWorld in a single pass. Ablation studies show that all three proposed components consistently improve a strong backbone, while a weaker backbone benefits more from these tools when the step budget is sufficient. Further analysis also shows that the Loop Breaker nearly halves wasted steps for loop-prone models.

1 Citations

0 Influential

13.5 Altmetric

68.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!