2601.09770v1 Jan 14, 2026 cs.AI

GUI-Eyes: GUI 에이전트의 시각적 그라운딩을 위한 도구 증강 인식

GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents

Dakuan Lu

Citations: 1

h-index: 1

Xiangcheng Liu

Citations: 8

h-index: 1

Hantao Yao

Citations: 1

h-index: 1

Chen Chen

Citations: 42

h-index: 3

Haoyi Hu

Citations: 72

h-index: 2

Wu Liu

Citations: 49

h-index: 4

Jiawei Shao

Citations: 520

h-index: 8

최근 비전-언어 모델(VLM)과 강화 학습(RL)의 발전은 GUI 자동화의 진보를 이끌었습니다. 그러나 기존의 대부분 방법들은 정적인 원샷 시각 입력과 수동적 인식에 의존하며, 인터페이스 관찰의 시기, 여부, 방법을 적응적으로 결정하는 능력이 부족합니다. 우리는 GUI 작업에서 능동적인 시각적 인식을 위한 강화 학습 프레임워크인 GUI-Eyes를 제안합니다. 더 유익한 관측 정보를 얻기 위해, 에이전트는 2단계 추론 과정 내에서 자르기(cropping)나 확대(zooming)와 같은 시각적 도구의 사용 여부와 방법을 전략적으로 결정하도록 학습합니다. 이러한 행동을 지원하기 위해, 우리는 의사 결정을 대략적인 탐색과 세밀한 그라운딩으로 분해하고 2단계 정책으로 조정하는 점진적 인식 전략을 도입합니다. 또한, 우리는 위치 근접성과 영역 겹침을 통합하여 도구 사용에 맞춘 공간적 연속 보상 함수를 설계했습니다. 이는 밀집된 지도 신호를 제공하고 GUI 환경에서 흔히 발생하는 보상 희소성 문제를 완화합니다. ScreenSpot-Pro 벤치마크에서 GUI-Eyes-3B는 단 3천 개의 레이블된 샘플만으로 44.8%의 그라운딩 정확도를 달성하여, 지도 학습 및 RL 기반 베이스라인 모델들을 크게 능가했습니다. 이러한 결과는 단계적 정책 추론과 세밀한 보상 피드백을 통해 가능해진 도구 인식 기반의 능동적 지각이 견고하고 데이터 효율적인 GUI 에이전트를 구축하는 데 중요하다는 것을 강조합니다.

Original Abstract

Recent advances in vision-language models (VLMs) and reinforcement learning (RL) have driven progress in GUI automation. However, most existing methods rely on static, one-shot visual inputs and passive perception, lacking the ability to adaptively determine when, whether, and how to observe the interface. We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks. To acquire more informative observations, the agent learns to make strategic decisions on both whether and how to invoke visual tools, such as cropping or zooming, within a two-stage reasoning process. To support this behavior, we introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding, coordinated by a two-level policy. In addition, we design a spatially continuous reward function tailored to tool usage, which integrates both location proximity and region overlap to provide dense supervision and alleviate the reward sparsity common in GUI environments. On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples, significantly outperforming both supervised and RL-based baselines. These results highlight that tool-aware active perception, enabled by staged policy reasoning and fine-grained reward feedback, is critical for building robust and data-efficient GUI agents.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!