2604.13531v1 Apr 15, 2026 cs.AI

RiskWebWorld: 전자상거래 위험 관리를 위한 GUI 에이전트를 평가하는 현실적인 인터랙티브 벤치마크

RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

Ze Xu

Citations: 31

h-index: 2

Tianyi Zhang

Citations: 35

h-index: 2

Renqi Chen

Citations: 131

h-index: 5

Zeyi Tao

Citations: 3

h-index: 1

Qingqing Sun

Citations: 23

h-index: 2

Shuai Chen

Citations: 26

h-index: 3

Jianming Guo

Citations: 7

h-index: 1

Jing Wang

Citations: 1,530

h-index: 7

Jingzhe Zhu

Citations: 3

h-index: 1

그래픽 사용자 인터페이스(GUI) 에이전트는 웹 작업을 자동화하는 데 강력한 기능을 제공하지만, 기존의 인터랙티브 벤치마크는 주로 예측 가능하고 안전한 소비자 환경을 대상으로 합니다. 이러한 벤치마크가 고위험, 조사적인 영역, 특히 실제 전자상거래 위험 관리 분야에서 얼마나 효과적인지는 아직 충분히 연구되지 않았습니다. 이러한 격차를 해소하기 위해, 전자상거래 위험 관리에 GUI 에이전트를 평가하기 위한 최초의 고도로 현실적인 인터랙티브 벤치마크인 RiskWebWorld를 제안합니다. RiskWebWorld는 8가지 핵심 영역에서 추출한 1,513개의 작업으로 구성되어 있으며, 협조하지 않는 웹사이트 및 부분적인 환경 조작과 같은 실제 위험 관리 운영의 어려움을 반영합니다. 확장 가능한 평가 및 에이전트 강화 학습(RL)을 지원하기 위해, Policy Planning과 환경 메커니즘을 분리하는 Gymnasium 호환 인프라를 구축했습니다. 다양한 모델에 대한 평가 결과, 최상위 수준의 범용 모델은 49.1%의 성공률을 보이는 반면, 특수 목적의 오픈소스 GUI 모델은 거의 실패하는 것으로 나타났습니다. 이는 현재 장기적인 전문 작업에서 파운데이션 모델의 규모가 제로샷 인터페이스 이해 능력보다 더 중요하다는 것을 시사합니다. 또한, 저희는 에이전트 RL을 통해 인프라의 실용성을 입증했으며, 이를 통해 오픈소스 모델의 성능을 16.2% 향상시켰습니다. 이러한 결과는 RiskWebWorld를 견고한 디지털 워커 개발을 위한 실질적인 테스트 환경으로 자리매김하게 합니다.

Original Abstract

Graphical User Interface (GUI) agents show strong capabilities for automating web tasks, but existing interactive benchmarks primarily target benign, predictable consumer environments. Their effectiveness in high-stakes, investigative domains such as authentic e-commerce risk management remains underexplored. To bridge this gap, we present RiskWebWorld, the first highly realistic interactive benchmark for evaluating GUI agents in e-commerce risk management. RiskWebWorld features 1,513 tasks sourced from production risk-control pipelines across 8 core domains, and captures the authentic challenges of risk operations on uncooperative websites, partially environmental hijackments. To support scalable evaluation and agentic reinforcement learning (RL), we further build a Gymnasium-compliant infrastructure that decouples policy planning from environment mechanics. Our evaluation across diverse models reveals a dramatic capability gap: top-tier generalist models achieve 49.1% success, while specialized open-weights GUI models lag at near-total failure. This highlights that foundation model scale currently matters more than zero-shot interface grounding in long-horizon professional tasks. We also demonstrate the viability of our infrastructure through agentic RL, which improves open-source models by 16.2%. These results position RiskWebWorld as a practical testbed for developing robust digital workers.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!