2603.19191v1 Mar 19, 2026 cs.AI

OS-Themis: 일반적인 GUI 환경에 적합한 확장 가능한 평가 프레임워크

OS-Themis: A Scalable Critic Framework for Generalist GUI Rewards

Zhuoran Li

Citations: 0

h-index: 0

Zichen Ding

Citations: 1,078

h-index: 10

Bowen Zhou

Citations: 114

h-index: 6

Kaiming Jin

Citations: 7

h-index: 2

Zhenyu Wu

Citations: 873

h-index: 8

Zhaoyang Liu

Citations: 528

h-index: 5

Zhoumianze Liu

Citations: 300

h-index: 4

Zun Wang

Citations: 4

h-index: 1

Jianze Liang

Citations: 22

h-index: 3

Yibo Zhao

Citations: 1

h-index: 1

강화 학습(RL)은 확률적 환경에서 GUI 에이전트의 견고성을 향상시킬 수 있는 잠재력을 가지고 있지만, 학습은 보상 함수의 품질에 매우 민감합니다. 기존의 보상 방법은 확장성과 성능을 동시에 달성하는 데 어려움을 겪습니다. 이러한 문제를 해결하기 위해, 우리는 확장 가능하고 정확한 다중 에이전트 평가 프레임워크인 OS-Themis를 제안합니다. OS-Themis는 단일 평가자 대신, 경로를 검증 가능한 이정표로 분해하여 의사 결정에 중요한 증거를 분리하고, 최종 판단을 내리기 전에 증거 체인을 엄격하게 감사하는 검토 메커니즘을 사용합니다. 평가를 용이하게 하기 위해, 우리는 GUI 결과 보상에 대한 종합적인 크로스 플랫폼 벤치마크인 OmniGUIRewardBench (OGRBench)를 추가로 소개합니다. OGRBench에서 평가된 모든 모델은 OS-Themis를 사용할 때 최고의 성능을 보입니다. AndroidWorld에서 수행된 광범위한 실험 결과, OS-Themis는 온라인 RL 학습을 지원할 때 10.3%의 성능 향상을, 그리고 자기 학습 루프에서 경로 검증 및 필터링에 사용될 때 6.9%의 성능 향상을 가져다주며, 에이전트 진화에 대한 잠재력을 보여줍니다.

Original Abstract

Reinforcement Learning (RL) has the potential to improve the robustness of GUI agents in stochastic environments, yet training is highly sensitive to the quality of the reward function. Existing reward approaches struggle to achieve both scalability and performance. To address this, we propose OS-Themis, a scalable and accurate multi-agent critic framework. Unlike a single judge, OS-Themis decomposes trajectories into verifiable milestones to isolate critical evidence for decision making and employs a review mechanism to strictly audit the evidence chain before making the final verdict. To facilitate evaluation, we further introduce OmniGUIRewardBench (OGRBench), a holistic cross-platform benchmark for GUI outcome rewards, where all evaluated models achieve their best performance under OS-Themis. Extensive experiments on AndroidWorld show that OS-Themis yields a 10.3% improvement when used to support online RL training, and a 6.9% gain when used for trajectory validation and filtering in the self-training loop, highlighting its potential to drive agent evolution.

0 Citations

0 Influential

5 Altmetric

25.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!