2602.02419v1 Feb 02, 2026 cs.AI

SafeGround: 불확실성 보정을 통한 GUI 그라운딩 모델의 신뢰 시점 파악

SafeGround: Know When to Trust GUI Grounding Models via Uncertainty Calibration

X. Wang

Citations: 368

h-index: 9

Yue Fan

Citations: 326

h-index: 9

Qingni Wang

Citations: 163

h-index: 5

그래픽 사용자 인터페이스(GUI) 그라운딩은 자연어 지시를 실행 가능한 화면 좌표로 변환하여 자동화된 GUI 상호작용을 가능하게 하는 것을 목표로 한다. 그러나 잘못된 그라운딩은 비용이 많이 들고 되돌리기 어려운 작업(예: 잘못된 결제 승인)을 초래할 수 있어 모델의 신뢰성에 대한 우려를 낳는다. 본 논문에서는 테스트 전 보정을 통해 위험을 인지하는 예측을 가능하게 하는 GUI 그라운딩 모델용 불확실성 인식 프레임워크인 SafeGround를 소개한다. SafeGround는 분포 인식 불확실성 정량화 방법을 활용하여 주어진 모델의 출력에서 얻은 확률적 샘플의 공간적 분산을 포착한다. 그 후, 보정 과정을 통해 SafeGround는 통계적으로 보장된 거짓 발견 비율(FDR) 제어를 갖춘 테스트 시점 결정 임계값을 도출한다. 우리는 까다로운 ScreenSpot-Pro 벤치마크를 대상으로 여러 GUI 그라운딩 모델에 SafeGround를 적용하였다. 실험 결과, 우리의 불확실성 척도는 정답 예측과 오답 예측을 구별하는 데 있어 기존 베이스라인보다 일관되게 우수한 성능을 보였으며, 보정된 임계값은 엄격한 위험 제어를 안정적으로 수행하고 시스템 수준의 정확도를 크게 향상시킬 잠재력을 보여주었다. 여러 GUI 그라운딩 모델에 걸쳐 SafeGround는 Gemini 단독 추론 대비 시스템 수준 정확도를 최대 5.38% 포인트 향상시켰다.

Original Abstract

Graphical User Interface (GUI) grounding aims to translate natural language instructions into executable screen coordinates, enabling automated GUI interaction. Nevertheless, incorrect grounding can result in costly, hard-to-reverse actions (e.g., erroneous payment approvals), raising concerns about model reliability. In this paper, we introduce SafeGround, an uncertainty-aware framework for GUI grounding models that enables risk-aware predictions through calibrations before testing. SafeGround leverages a distribution-aware uncertainty quantification method to capture the spatial dispersion of stochastic samples from outputs of any given model. Then, through the calibration process, SafeGround derives a test-time decision threshold with statistically guaranteed false discovery rate (FDR) control. We apply SafeGround on multiple GUI grounding models for the challenging ScreenSpot-Pro benchmark. Experimental results show that our uncertainty measure consistently outperforms existing baselines in distinguishing correct from incorrect predictions, while the calibrated threshold reliably enables rigorous risk control and potentials of substantial system-level accuracy improvements. Across multiple GUI grounding models, SafeGround improves system-level accuracy by up to 5.38\% percentage points over Gemini-only inference.

1 Citations

0 Influential

4.5 Altmetric

23.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!