2603.08013v1 Mar 09, 2026 cs.AI

PIRA-Bench: 반응형 GUI 에이전트에서 GUI 기반 능동적 의도 추천 에이전트로의 전환

PIRA-Bench: A Transition from Reactive GUI Agents to GUI-based Proactive Intent Recommendation Agents

Yuxiang Chai

The Chinese University of Hong Kong

Citations: 364

h-index: 8

Shunye Tang

Citations: 11

h-index: 2

Han Xiao

Citations: 121

h-index: 4

Rui Liu

Citations: 25

h-index: 2

Hongsheng Li

Citations: 180

h-index: 4

현재 그래픽 사용자 인터페이스(GUI) 에이전트는 주로 반응형 방식으로 작동합니다. 즉, 사용자가 작업을 수행하도록 에이전트에게 명시적인 지시를 내려야 합니다. 그러나 지능적인 AI 어시스턴트는 능동적으로 작동하여야 하며, 모바일 또는 데스크톱 스크린샷과 같은 연속적인 시각적 입력으로부터 사용자의 의도를 예측하고, 명시적인 사용자 요청 없이 적시에 추천을 제공할 수 있어야 합니다. 이러한 능동형 패러다임으로의 전환은 상당한 어려움을 안고 있습니다. 실제 화면 활동은 종종 선형적이지 않으며, 잡음이 많은 탐색, 의미 없는 작업 및 멀티 스레드 작업 전환으로 이루어진 장기적인 경로를 포함합니다. 이러한 격차를 해결하기 위해, 우리는 PIRA-Bench(Proactive Intent Recommendation Agent Benchmark)를 소개합니다. PIRA-Bench는 멀티모달 대규모 언어 모델(MLLM)을 연속적이고 약하게 감독된 시각적 입력에 대해 평가하는 새로운 벤치마크입니다. PIRA-Bench는 기존의 반응형 데이터 세트와 달리, 다양한 사용자 프로필 컨텍스트와 함께 여러 개의 얽힌 의도와 노이즈가 포함된 복잡한 경로를 특징으로 하며, 에이전트가 사용자 선호도에 맞춰 실행 가능한 이벤트를 감지하도록 설계되었습니다. 또한, 우리는 여러 작업 스레드를 관리하고 오해의 소지가 있는 시각적 입력을 처리할 수 있도록 일반적인 MLLM에 메모리 인식 상태 추적 프레임워크인 PIRF 베이스라인을 제안합니다. PIRA-Bench는 견고하고 능동적인 GUI 기반 개인 어시스턴트를 향한 첫 번째 단계입니다.

Original Abstract

Current Graphical User Interface (GUI) agents operate primarily under a reactive paradigm: a user must provide an explicit instruction for the agent to execute a task. However, an intelligent AI assistant should be proactive, which is capable of anticipating user intentions directly from continuous visual inputs, such as mobile or desktop screenshots, and offering timely recommendations without explicit user prompting. Transitioning to this proactive paradigm presents significant challenges. Real-world screen activity is rarely linear; it consists of long-horizon trajectories fraught with noisy browsing, meaningless actions, and multithreaded task-switching. To address this gap, we introduce PIRA-Bench (Proactive Intent Recommendation Agent Benchmark), a novel benchmark for evaluating multimodal large language models (MLLMs) on continuous, weakly-supervised visual inputs. Unlike reactive datasets, PIRA-Bench features complex trajectories with multiple interleaved intents and noisy segments with various user profile contexts, challenging agents to detect actionable events while fitting to user preferences. Furthermore, we propose the PIRF baseline, a memory-aware, state-tracking framework that empowers general MLLMs to manage multiple task threads and handle misleading visual inputs. PIRA-Bench serves as an initial step toward robust and proactive GUI-based personal assistants.

1 Citations

0 Influential

4 Altmetric

21.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!