2601.04500v1 Jan 08, 2026 cs.AI

GUITester: 탐색적 결함 발견을 위한 GUI 에이전트 활성화

GUITester: Enabling GUI Agents for Exploratory Defect Discovery

Yifei Gao

Citations: 105

h-index: 5

Xiaoyi Chen

Citations: 11

h-index: 2

Zhe Cui

Citations: 6

h-index: 2

Jitao Sang

Citations: 134

h-index: 7

Jiang Wu

Citations: 43

h-index: 4

Yifan Yang

Citations: 36

h-index: 4

Jiaming Zhang

Citations: 191

h-index: 5

Tianyi Ma

Citations: 7

h-index: 2

탐색적 GUI 테스팅은 소프트웨어 품질에 필수적이지만 높은 수동 비용이 소요됩니다. 멀티모달 대규모 언어 모델(MLLM) 에이전트는 내비게이션에는 뛰어나지만, 두 가지 핵심적인 문제로 인해 자율적으로 결함을 발견하는 데는 실패합니다. 첫째는 에이전트가 이상 징후를 보고하는 것보다 작업 완료를 우선시하는 '목표 지향적 마스킹(Goal-Oriented Masking)'이며, 둘째는 시스템 결함을 에이전트의 오류로 잘못 식별하는 '실행 편향 귀인(Execution-Bias Attribution)'입니다. 이를 해결하기 위해 본 연구에서는 먼저 26개 결함에 걸친 143개 작업을 포함하는 해당 작업을 위한 최초의 상호작용형 벤치마크인 GUITestBench를 소개합니다. 이어 내비게이션과 검증을 분리하는 멀티 에이전트 프레임워크인 GUITester를 제안합니다. 이는 (i) 내장된 테스트 의도를 통해 결함을 능동적으로 탐색하는 '계획-실행 모듈(PEM)'과 (ii) 상호작용 기록 분석을 통해 귀인의 모호성을 해결하는 '계층적 성찰 모듈(HRM)'의 두 가지 모듈로 구성됩니다. GUITester는 GUITestBench에서 48.90%(Pass@3)의 F1 점수를 달성하여 최신 베이스라인(33.35%)을 능가했습니다. 본 연구는 자율 탐색적 테스팅의 실현 가능성을 입증하고 미래의 GUI 품질 보증을 위한 견고한 기반을 제공합니다.

Original Abstract

Exploratory GUI testing is essential for software quality but suffers from high manual costs. While Multi-modal Large Language Model (MLLM) agents excel in navigation, they fail to autonomously discover defects due to two core challenges: \textit{Goal-Oriented Masking}, where agents prioritize task completion over reporting anomalies, and \textit{Execution-Bias Attribution}, where system defects are misidentified as agent errors. To address these, we first introduce \textbf{GUITestBench}, the first interactive benchmark for this task, featuring 143 tasks across 26 defects. We then propose \textbf{GUITester}, a multi-agent framework that decouples navigation from verification via two modules: (i) a \textit{Planning-Execution Module (PEM)} that proactively probes for defects via embedded testing intents, and (ii) a \textit{Hierarchical Reflection Module (HRM)} that resolves attribution ambiguity through interaction history analysis. GUITester achieves an F1-score of 48.90\% (Pass@3) on GUITestBench, outperforming state-of-the-art baselines (33.35\%). Our work demonstrates the feasibility of autonomous exploratory testing and provides a robust foundation for future GUI quality assurance~\footnote{Our code is now available in~\href{https://github.com/ADaM-BJTU/GUITestBench}{https://github.com/ADaM-BJTU/GUITestBench}}.

4 Citations

1 Influential

35.92453324894 Altmetric

185.6 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!