2602.01664v3 Feb 02, 2026 cs.AI

FlowSteer: 엔드 투 엔드 강화 학습을 통한 상호작용형 에이전트 기반 워크플로우 오케스트레이션

FlowSteer: Interactive Agentic Workflow Orchestration via End-to-End Reinforcement Learning

Tiesunlong Shen

Citations: 69

h-index: 6

Erik Cambria

Citations: 35

h-index: 3

Haoran Luo

Citations: 31

h-index: 3

Rui Mao

Citations: 95

h-index: 3

Mingda Zhang

Citations: 2,758

h-index: 3

Qika Lin

Citations: 270

h-index: 8

Xiaoying Tang

Citations: 47

h-index: 4

최근 몇 년 동안 다양한 강력한 에이전트 기반 워크플로우가 인간의 다양한 문제를 해결하는 데 활용되어 왔습니다. 그러나 기존 워크플로우 오케스트레이션은 여전히 높은 수동 작업 비용, 특정 운영자/대규모 언어 모델(LLM)에 대한 의존성, 그리고 희소한 보상 신호와 같은 주요 과제를 안고 있습니다. 이러한 문제점을 해결하기 위해, 우리는 FlowSteer라는 엔드 투 엔드 강화 학습 프레임워크를 제안합니다. FlowSteer는 경량 정책 모델을 에이전트로 사용하고 실행 가능한 캔버스 환경을 활용하여 다단계 상호작용을 통해 워크플로우 오케스트레이션을 자동화합니다. 이 과정에서 정책 모델은 실행 상태를 분석하고 편집 작업을 선택하며, 캔버스는 운영자를 실행하고 반복적인 개선을 위한 피드백을 제공합니다. 또한, FlowSteer는 다양한 운영자 라이브러리와 교체 가능한 LLM 백엔드를 지원하는 플러그 앤 플레이 프레임워크를 제공합니다. 이러한 상호작용 패러다임을 효과적으로 학습시키기 위해, 우리는 Canvas Workflow Relative Policy Optimization (CWRPO)이라는 방법을 제안합니다. CWRPO는 학습을 안정화하고 단축 경로 행동을 억제하기 위해 조건부 방출을 사용하는 다양성 제약 보상을 도입합니다. 12개의 데이터 세트에 대한 실험 결과는 FlowSteer가 다양한 작업에서 기존 방법보다 훨씬 우수한 성능을 보임을 보여줍니다.

Original Abstract

In recent years, a variety of powerful agentic workflows have been applied to solve a wide range of human problems. However, existing workflow orchestration still faces key challenges, including high manual cost, reliance on specific operators/large language models (LLMs), and sparse reward signals. To address these challenges, we propose FlowSteer, an end-to-end reinforcement learning framework that takes a lightweight policy model as the agent and an executable canvas environment, automating workflow orchestration through multi-turn interaction. In this process, the policy model analyzes execution states and selects editing actions, while the canvas executes operators and returns feedback for iterative refinement. Moreover, FlowSteer provides a plug-and-play framework that supports diverse operator libraries and interchangeable LLM backends. To effectively train this interaction paradigm, we propose Canvas Workflow Relative Policy Optimization (CWRPO), which introduces diversity-constrained rewards with conditional release to stabilize learning and suppress shortcut behaviors. Experimental results on twelve datasets show that FlowSteer significantly outperforms baselines across various tasks.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!