2602.13653v1 Feb 14, 2026 cs.AI

Agentic-Q 추정과 단계별 정책 최적화를 통한 자율 GUI 탐색 구축

Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization

Weihua Luo

Citations: 738

h-index: 13

Kaifu Zhang

Citations: 685

h-index: 12

Yibo Wang

Citations: 616

h-index: 7

Guangda Huzhang

Citations: 192

h-index: 8

Yuwei Hu

Citations: 31

h-index: 3

Yu Xia

Citations: 87

h-index: 4

Shiyin Lu

Citations: 330

h-index: 8

Qing-Guo Chen

Citations: 416

h-index: 9

Zhao Xu

Citations: 378

h-index: 8

Lijun Zhang

Citations: 61

h-index: 3

멀티모달 대형 언어 모델(MLLM)의 최근 발전은 그래픽 사용자 인터페이스(GUI)를 위한 자율 에이전트의 진보를 상당히 가속화했습니다. 그럼에도 불구하고, 실제 응용 환경에서 GUI 에이전트는 종종 비정상(non-stationary) 환경에 직면하게 되며, 이는 데이터 큐레이션 및 정책 최적화에 높은 계산 비용을 초래합니다. 본 보고서에서 우리는 GUI 에이전트를 위한 새로운 MLLM 중심 프레임워크를 소개하며, 이는 Agentic-Q 추정과 단계별 정책 최적화라는 두 가지 구성 요소로 이루어져 있습니다. 전자는 주어진 행동이 작업 완료에 기여하는 정도를 평가하기 위해 단계별 가치를 생성할 수 있는 Q-모델을 최적화하는 것을 목표로 합니다. 후자는 상태-행동 궤적에서 가져온 단계별 샘플을 입력으로 사용하여, 우리의 Agentic-Q 모델을 기반으로 한 강화 학습을 통해 정책을 최적화합니다. 주목할 점은 (i) 모든 상태-행동 궤적이 정책 자체에 의해 생성되므로 데이터 수집 비용 관리가 용이하다는 점과, (ii) 정책 업데이트가 환경과 분리되어 있어 안정적이고 효율적인 최적화를 보장한다는 점입니다. 실증적 평가 결과, 우리의 프레임워크는 Ovis2.5-9B 모델에 강력한 GUI 상호 작용 능력을 부여하여, GUI 탐색 및 그라운딩 벤치마크에서 괄목할 만한 성과를 달성하였으며, 더 큰 규모의 경쟁 모델들을 능가하는 성능을 보여주었습니다.

Original Abstract

Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.

1 Citations

0 Influential

6.5 Altmetric

33.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!