2605.14311v1 May 14, 2026 cs.LG

이분법을 넘어서: GUI 비판을 연속적인 의미 정렬로 재구성

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

Pei Fu

Citations: 28

h-index: 4

Ruoceng Zhang

Citations: 9

h-index: 2

Shaojie Zhang

Citations: 9

h-index: 2

Xiuwen Xi

Citations: 5

h-index: 2

Zhenbo Luo

Citations: 213

h-index: 7

Jian Luan

Citations: 220

h-index: 7

Chongyang Zhang

Citations: 125

h-index: 4

Yuchen Sun

Citations: 23

h-index: 2

Anan Du

Citations: 17

h-index: 3

테스트 시간 스케일링(TTS)은 다양한 후보 액션을 샘플링하고, 비평 모델을 통해 순위를 매기는 방식으로, 범용 GUI 에이전트에 대한 유망한 패러다임으로 부상했습니다. TTS의 효과는 비평 모델의 미세한 순위 결정 능력에 달려있습니다. 그러나 기존의 GUI 비평 모델은 대부분 이진 분류 방식을 채택합니다. 이러한 모델에 대한 우리의 분석 결과, 심각한 문제가 존재합니다. 즉, 유효한 액션과 가능성이 있지만 유효하지 않은 액션(distractors)의 점수가 구별 불가능해지는 현상이 발생합니다. 이러한 실패는 두 가지 구조적 결함에서 비롯됩니다. 첫째, "어포던스 붕괴(Affordance Collapse)"는 계층적 어포던스 공간이 0 또는 1 레이블로 압축되는 현상입니다. 둘째, "노이즈 민감성(Noise Sensitivity)"은 이진 목표가 노이즈가 많은 결정 경계에 과적합되는 현상입니다. 이러한 문제를 해결하기 위해, 우리는 기능적 동등성 가설(Functional Equivalence Hypothesis)에 기반한 새로운 패러다임인 BBCritic (Beyond-Binary Critic)을 제안합니다. BBCritic는 두 단계의 대조 학습을 통해, 명령어와 액션을 공유된 어포던스 공간에 정렬하여, 이진 지도 학습으로 인해 평탄해지는 계층적 구조를 복원합니다. 또한, 우리는 이진 분류 모델의 성능을 미세하게 평가할 수 있는, 연속적인 액션 공간과 계층적인 4단계 분류 체계를 결합한 최초의 GUI 비평 벤치마크인 BBBench (Beyond-Binary Bench)를 제시합니다. 실험 결과, 추가적인 어노테이션 없이 학습된 BBCritic-3B는 70억 개의 파라미터를 가진 최첨단 이진 분류 모델보다 뛰어난 성능을 보였습니다. 또한, BBCritic는 다양한 플랫폼과 작업에서 강력한 제로샷 전이 능력을 보여주며, GUI 비판은 근본적으로 분류 문제가 아니라 메트릭 학습 문제라는 우리의 견해를 뒷받침합니다.

Original Abstract

Test-Time Scaling (TTS), which samples multiple candidate actions and ranks them via a Critic Model, has emerged as a promising paradigm for generalist GUI agents. Its efficacy thus hinges on the critic's fine-grained ranking ability. However, existing GUI critic models uniformly adopt binary classification. Our motivational analysis of these models exposes a severe entanglement: scores for valid actions and plausible-but-invalid distractors become indistinguishable. We attribute this failure to two structural defects: Affordance Collapse--the hierarchical affordance space is compressed into 0/1 labels; and Noise Sensitivity--binary objectives overfit to noisy decision boundaries. To resolve this, we introduce BBCritic (Beyond-Binary Critic), a paradigm shift grounded in the Functional Equivalence Hypothesis. Through two-stage contrastive learning, BBCritic aligns instructions and actions in a shared Affordance Space, recovering the hierarchical structure that binary supervision flattens. We also present BBBench (Beyond-Binary Bench), the first GUI critic benchmark that pairs a dense action space with a hierarchical four-level taxonomy, enabling fine-grained ranking evaluation. Experimental results show that BBCritic-3B, trained without any extra annotation, outperforms 7B-parameter SOTA binary models. It demonstrates strong zero-shot transferability across platforms and tasks, supporting our methodological view: GUI critique is fundamentally a metric-learning problem, not a classification one.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!