2604.05157v1 Apr 06, 2026 cs.AI

IntentScore: 의도 기반 행동 평가를 통한 컴퓨터 사용 에이전트 성능 향상

IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

Sizhe Tang

Citations: 11

h-index: 2

Tian Lan

Citations: 10

h-index: 2

Zeyu Fang

Citations: 34

h-index: 3

Rongqian Chen

Citations: 15

h-index: 2

Yu Li

Citations: 8

h-index: 2

Weidong Cao

Citations: 10

h-index: 2

컴퓨터 사용 에이전트(CUAs)는 대규모 언어 모델을 활용하여 데스크톱 환경에서 GUI 작업을 수행하지만, 행동의 품질을 평가하지 않고 작업을 진행하기 때문에 심각한 오류가 발생하고, 이는 이후 단계로 연쇄적으로 이어질 수 있습니다. 본 연구에서는 IntentScore라는 계획 기반 보상 모델을 제안합니다. 이 모델은 세 가지 운영체제에서 수집된 398,000개의 오프라인 GUI 상호작용 데이터를 사용하여 후보 행동을 평가하도록 학습됩니다. IntentScore는 두 가지 상호 보완적인 목표를 가지고 학습됩니다. 첫째, 상태-행동의 관련성을 평가하는 대비 학습(contrastive alignment)을 통해 학습하고, 둘째, 행동의 정확도를 평가하는 마진 순위 학습(margin ranking)을 통해 학습합니다. 구조적으로, IntentScore는 각 후보 행동의 계획 의도를 행동 인코더에 포함시켜, 유사한 행동이지만 다른 이유를 가진 후보들 간의 구분을 가능하게 합니다. IntentScore는 검증 데이터 세트에서 97.5%의 쌍별 판별 정확도를 달성했습니다. IntentScore는 OSWorld 환경에서 Agent S3의 재순위화(re-ranker)로 사용되었으며, 학습 과정에서 전혀 보지 못했던 환경에서 6.9%의 작업 성공률 향상을 보여주었습니다. 이는 이기종의 오프라인 데이터에서 학습된 보상 추정 모델이 새로운 에이전트 및 작업 분포에도 일반화될 수 있음을 입증합니다.

Original Abstract

Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!