2604.22558v1 Apr 24, 2026 cs.LG

SOLAR-RL: 준온라인 장기 계획 강화 학습을 활용한 GUI 작업 할당

SOLAR-RL: Semi-Online Long-horizon Assignment Reinforcement Learning

Han Xiao

Citations: 159

h-index: 5

Guozhi Wang

Citations: 131

h-index: 7

Yufeng Zhou

Citations: 21

h-index: 2

Jichao Wang

Citations: 26

h-index: 2

Yafei Wen

Citations: 397

h-index: 8

Xiaoxin Chen

Citations: 341

h-index: 7

Liuyang Bian

Citations: 46

h-index: 2

Yue Pan

Citations: 61

h-index: 4

Hao Wang

Citations: 172

h-index: 5

Shuai Ren

Citations: 88

h-index: 4

Lingfang Zeng

Citations: 8

h-index: 2

Zhaoxiong Wang

Citations: 52

h-index: 3

다중 모드 대규모 언어 모델(MLLM)이 발전함에 따라, GUI 에이전트는 정적인 상호 작용에서 복잡한 탐색으로 진화하고 있습니다. 강화 학습(RL)은 동적인 GUI 작업에 대한 MLLM 에이전트를 훈련하는 유망한 패러다임으로 부상했지만, 효과적인 적용에는 어려움이 있습니다. 기존의 오프라인 강화 학습은 종종 정적인 단계별 데이터에 의존하며, 작업 완료 및 실행 품질과 같은 전체 경로 의미론을 고려하지 않습니다. 반면, 온라인 강화 학습은 장기적인 동역학을 파악하지만, 높은 상호 작용 비용과 잠재적인 환경 불안정성으로 인해 어려움을 겪습니다. 이러한 간극을 해소하기 위해, 본 연구에서는 준온라인 장기 계획 강화 학습(SOLAR-RL) 프레임워크를 제안합니다. 저희의 프레임워크는 비용이 많이 드는 온라인 상호 작용에만 의존하는 대신, 전체 경로에 대한 통찰력을 오프라인 학습 과정에 직접 통합합니다. 구체적으로, 저희는 정적 데이터로부터 다양한 시뮬레이션 경로를 생성하고, 단계별 유효성 신호를 사용하여 첫 번째 실패 지점을 감지하며, 목표에 맞춘 보상 형식을 사용하여 전체 경로 실행 품질을 반영하는 밀집된 단계별 보상을 역으로 할당합니다. 이를 통해 온라인 피드백을 효과적으로 시뮬레이션하지만 상호 작용 비용은 발생하지 않습니다. 광범위한 실험 결과, SOLAR-RL은 강력한 기준 모델에 비해 장기 작업 완료율과 안정성을 크게 향상시키며, 자율적인 GUI 탐색을 위한 샘플 효율적인 솔루션을 제공하는 것으로 나타났습니다.

Original Abstract

As Multimodal Large Language Models (MLLMs) mature, GUI agents are evolving from static interactions to complex navigation. While Reinforcement Learning (RL) has emerged as a promising paradigm for training MLLM agents on dynamic GUI tasks, its effective application faces a dilemma. Standard Offline RL often relies on static step-level data, neglecting global trajectory semantics such as task completion and execution quality. Conversely, Online RL captures the long-term dynamics but suffers from high interaction costs and potential environmental instability. To bridge this gap, we propose SOLAR-RL (Semi-Online Long-horizon Assignment Reinforcement Learning). Instead of relying solely on expensive online interactions, our framework integrates global trajectory insights directly into the offline learning process. Specifically, we reconstruct diverse rollout candidates from static data, detect the first failure point using per-step validity signals, and retroactively assign dense step-level rewards with target-aligned shaping to reflect trajectory-level execution quality, effectively simulating online feedback without interaction costs. Extensive experiments demonstrate that SOLAR-RL significantly improves long-horizon task completion rates and robustness compared to strong baselines, offering a sample-efficient solution for autonomous GUI navigation.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!