2601.18197v1 Jan 26, 2026 cs.AI

GAIA: GUI 테스트 시간 스케일링 크리틱 모델 훈련을 위한 데이터 플라이휠 시스템

GAIA: A Data Flywheel System for Training GUI Test-Time Scaling Critic Models

Shaokang Wang

Citations: 9

h-index: 2

Pei Fu

Citations: 28

h-index: 4

Ruoceng Zhang

Citations: 9

h-index: 2

Shaojie Zhang

Citations: 9

h-index: 2

Xiuwen Xi

Citations: 5

h-index: 2

Jiahui Yang

Citations: 126

h-index: 3

Bin Qin

Citations: 21

h-index: 3

Ying Huang

Citations: 19

h-index: 3

Zhenbo Luo

Citations: 213

h-index: 7

Jian Luan

Citations: 12

h-index: 2

대형 시각-언어 모델(LVLM)은 텍스트 지시 분석, 화면 내용 해석, 작업 실행 등 GUI 에이전트의 능력을 크게 발전시켰지만, 중요한 난제가 남아 있습니다. 바로 에이전트 작업의 비가역성으로, 단 한 번의 잘못된 행동이 치명적인 이탈을 초래할 수 있다는 점입니다. 이를 해결하기 위해 본 논문에서는 모델이 반복적인 크리틱(비평) 능력을 갖추도록 하는 훈련 프레임워크인 GUI 액션 크리틱의 데이터 플라이휠 시스템(GAIA)을 제안합니다. 이는 기본 GUI 에이전트 성능의 테스트 시간 스케일링(TTS)을 개선하는 데 활용됩니다. 구체적으로, 우리는 먼저 베이스 에이전트의 긍정 및 부정 행동 예시를 사용하여 직관적 크리틱 모델(ICM)을 훈련합니다. 이 크리틱은 에이전트가 의도한 행동의 즉각적인 정확성을 평가하여 성공 확률이 더 높은 작업을 선택합니다. 그 후, 초기 크리틱은 에이전트의 행동을 유도하여 더 정제된 긍정/부정 샘플을 수집함으로써 자기 개선 주기를 시작합니다. 이렇게 증강된 데이터는 식별 능력이 향상된 2차 크리틱을 훈련하는 데 사용됩니다. 우리는 다양한 데이터셋에 대한 실험을 통해 제안된 ICM이 여러 폐쇄형 및 오픈 소스 모델의 테스트 시간 성능을 향상시킬 수 있으며, 데이터가 재순환됨에 따라 성능이 점진적으로 개선됨을 입증합니다. 코드와 데이터셋은 공개될 예정입니다.

Original Abstract

While Large Vision-Language Models (LVLMs) have significantly advanced GUI agents' capabilities in parsing textual instructions, interpreting screen content, and executing tasks, a critical challenge persists: the irreversibility of agent operations, where a single erroneous action can trigger catastrophic deviations. To address this, we propose the GUI Action Critic's Data Flywheel System (GAIA), a training framework that enables the models to have iterative critic capabilities, which are used to improve the Test-Time Scaling (TTS) of basic GUI agents' performance. Specifically, we train an Intuitive Critic Model (ICM) using positive and negative action examples from a base agent first. This critic evaluates the immediate correctness of the agent's intended actions, thereby selecting operations with higher success probability. Then, the initial critic guides agent actions to collect refined positive/negative samples, initiating the self-improving cycle. The augmented data then trains a second-round critic with enhanced discernment capability. We conduct experiments on various datasets and demonstrate that the proposed ICM can improve the test-time performance of various closed-source and open-source models, and the performance can be gradually improved as the data is recycled. The code and dataset will be publicly released.

2 Citations

0 Influential

3.5 Altmetric

19.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!