2603.17441v1 Mar 18, 2026 cs.CV

AdaZoom-GUI: 적응적 확대 기반 GUI 정렬 및 지시문 개선

AdaZoom-GUI: Adaptive Zoom-based GUI Grounding with Instruction Refinement

Siqi Pei

Citations: 25

h-index: 2

Liang Tang

Citations: 12

h-index: 2

Tiaonan Duan

Citations: 8

h-index: 1

Long Chen

Citations: 73

h-index: 3

Kaer Huang

Citations: 9

h-index: 1

Yiqiang Yan

Citations: 11

h-index: 2

Chen Jiang

Citations: 5

h-index: 1

Borui Zhang

Tsinghua University

Citations: 508

h-index: 8

Jiwen Lu

Citations: 264

h-index: 4

Shuxian Li

Citations: 9

h-index: 1

Yanzhe Jing

Citations: 8

h-index: 1

Bo Zhang

Citations: 0

h-index: 0

GUI 정렬은 시각-언어 모델(VLM)의 중요한 기능으로, 자연어 지시문을 통해 그래픽 사용자 인터페이스의 대상 요소를 찾아 자동화된 상호 작용을 가능하게 합니다. 그러나 고해상도 이미지, 작은 UI 요소 및 모호한 사용자 지시문으로 인해 GUI 스크린샷에 대한 정렬은 여전히 어려운 과제입니다. 본 연구에서는 적응적 확대 기반 GUI 정렬 프레임워크인 AdaZoom-GUI를 제안하며, 이는 정렬 정확도와 지시문 이해도를 모두 향상시킵니다. 우리의 접근 방식은 지시문 개선 모듈을 도입하여 자연어 명령을 명시적이고 상세한 설명으로 재작성함으로써, 정렬 모델이 정확한 요소 위치를 파악하는 데 집중할 수 있도록 합니다. 또한, 예측된 작은 요소에 대해 선택적으로 두 번째 단계의 추론을 수행하는 조건부 확대 전략을 설계하여, 정렬 정확도를 향상시키는 동시에 단순한 경우에 불필요한 계산과 컨텍스트 손실을 방지합니다. 본 프레임워크를 지원하기 위해 고품질의 GUI 정렬 데이터셋을 구축하고, 그룹 상대 정책 최적화(GRPO)를 사용하여 정렬 모델을 학습시켰으며, 이를 통해 모델은 클릭 좌표와 요소 경계 상자 모두를 예측할 수 있습니다. 공개 벤치마크에서의 실험 결과, 우리의 방법은 유사하거나 더 큰 파라미터 크기를 가진 모델 중에서 최첨단 성능을 달성했으며, 이는 고해상도 GUI 이해 및 실제 GUI 에이전트 배포에 효과적임을 보여줍니다.

Original Abstract

GUI grounding is a critical capability for vision-language models (VLMs) that enables automated interaction with graphical user interfaces by locating target elements from natural language instructions. However, grounding on GUI screenshots remains challenging due to high-resolution images, small UI elements, and ambiguous user instructions. In this work, we propose AdaZoom-GUI, an adaptive zoom-based GUI grounding framework that improves both localization accuracy and instruction understanding. Our approach introduces an instruction refinement module that rewrites natural language commands into explicit and detailed descriptions, allowing the grounding model to focus on precise element localization. In addition, we design a conditional zoom-in strategy that selectively performs a second-stage inference on predicted small elements, improving localization accuracy while avoiding unnecessary computation and context loss on simpler cases. To support this framework, we construct a high-quality GUI grounding dataset and train the grounding model using Group Relative Policy Optimization (GRPO), enabling the model to predict both click coordinates and element bounding boxes. Experiments on public benchmarks demonstrate that our method achieves state-of-the-art performance among models with comparable or even larger parameter sizes, highlighting its effectiveness for high-resolution GUI understanding and practical GUI agent deployment.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!