2602.20502v1 Feb 24, 2026 cs.AI

ActionEngine: 상태 머신 메모리를 활용한 반응형에서 프로그래밍 방식 GUI 에이전트로의 전환

ActionEngine: From Reactive to Programmatic GUI Agents via State Machine Memory

Fazle Faisal

Citations: 19

h-index: 3

Tanakorn Leesatapornwongsa

Citations: 871

h-index: 11

Adriana Szekeres

Citations: 60

h-index: 3

Kexin Rong

Citations: 41

h-index: 4

Suman Nath

Citations: 28

h-index: 4

Hongbin Zhong

Citations: 19

h-index: 3

Luis França

Citations: 5

h-index: 1

기존의 그래픽 사용자 인터페이스(GUI) 에이전트는 주로 화면 캡처, 다음 동작에 대한 추론, 실행, 그리고 새로운 페이지에 대한 반복 과정을 거치면서 작동합니다. 이러한 방식은 추론 단계의 수에 따라 비용과 지연 시간이 증가하며, 이전에 방문한 페이지에 대한 지속적인 메모리 부족으로 인해 정확도가 제한됩니다. 저희는 훈련이 필요 없는 ActionEngine 프레임워크를 제안합니다. 이 프레임워크는 혁신적인 두 에이전트 아키텍처를 통해 반응형 실행에서 프로그래밍 방식으로의 전환을 가능하게 합니다. 첫 번째 에이전트인 Crawling Agent는 오프라인 탐색을 통해 GUI의 업데이트 가능한 상태 머신 메모리를 구축하고, 두 번째 에이전트인 Execution Agent는 이 메모리를 활용하여 온라인 작업 실행을 위한 완전하고 실행 가능한 Python 프로그램을 생성합니다. 진화하는 인터페이스에 대한 견고성을 확보하기 위해, 실행 실패는 비전 기반의 재접지(re-grounding) 기능을 활성화하여 실패한 작업을 수정하고 메모리를 업데이트합니다. 이러한 설계는 효율성과 정확성을 크게 향상시킵니다. WebArena 벤치마크의 Reddit 작업에서, 저희 에이전트는 평균적으로 단일 LLM 호출로 95%의 작업 성공률을 달성했습니다. 이는 비전 기반의 최적 성능을 보이는 기존 모델의 66%보다 높으며, 비용을 11.8배, 전체 지연 시간을 2배 줄이는 효과를 가져왔습니다. 이러한 구성 요소들은 전체적인 프로그래밍 방식 계획, 크롤러에 의해 검증된 동작 템플릿, 그리고 노드 수준의 실행과 함께 로컬 검증 및 수정을 결합하여 확장 가능하고 안정적인 GUI 상호 작용을 가능하게 합니다.

Original Abstract

Existing Graphical User Interface (GUI) agents operate through step-by-step calls to vision language models--taking a screenshot, reasoning about the next action, executing it, then repeating on the new page--resulting in high costs and latency that scale with the number of reasoning steps, and limited accuracy due to no persistent memory of previously visited pages. We propose ActionEngine, a training-free framework that transitions from reactive execution to programmatic planning through a novel two-agent architecture: a Crawling Agent that constructs an updatable state-machine memory of the GUIs through offline exploration, and an Execution Agent that leverages this memory to synthesize complete, executable Python programs for online task execution. To ensure robustness against evolving interfaces, execution failures trigger a vision-based re-grounding fallback that repairs the failed action and updates the memory. This design drastically improves both efficiency and accuracy: on Reddit tasks from the WebArena benchmark, our agent achieves 95% task success with on average a single LLM call, compared to 66% for the strongest vision-only baseline, while reducing cost by 11.8x and end-to-end latency by 2x. Together, these components yield scalable and reliable GUI interaction by combining global programmatic planning, crawler-validated action templates, and node-level execution with localized validation and repair.

5 Citations

0 Influential

5.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!