2602.01202v1 Feb 01, 2026 cs.AI

Workflow-R1: 멀티 턴 워크플로우 구축을 위한 그룹 부분 시퀀스 정책 최적화

Workflow-R1: Group Sub-sequence Policy Optimization for Multi-turn Workflow Construction

Mingze Kong

Citations: 7

h-index: 1

Zikun Qu

Citations: 8

h-index: 1

Zhongquan Zhou

Citations: 1

h-index: 1

Pengyu Liang

Citations: 1

h-index: 1

Xiang Li

Citations: 6

h-index: 1

Zhiwei Shang

Citations: 3

h-index: 1

Zhi Hong

Citations: 2,716

h-index: 2

Kaiyu Huang

Citations: 44

h-index: 3

Zhiyong Wang

Citations: 17

h-index: 3

Zhongxiang Dai

Citations: 44

h-index: 4

에이전트 워크플로우의 급격한 발전은 복잡한 추론 작업을 해결하는 데 있어 LLM 기반 에이전트의 강력한 성능을 입증했습니다. 그러나 기존의 워크플로우 최적화 방법들은 일반적으로 워크플로우 합성을 정적이고 단발적인(one-shot) 코드 중심의 생성 문제로 정식화합니다. 이러한 패러다임은 모델의 코딩 능력에 과도한 제약을 가하고 동적인 문제 해결에 필요한 유연성을 제한합니다. 본 논문에서는 워크플로우 구축을 멀티 턴 자연어 기반의 순차적 의사 결정 과정으로 재구성하는 프레임워크인 Workflow-R1을 제안합니다. 이러한 멀티 턴 상호작용에 내재된 최적화 입도(granularity) 불일치 문제를 해결하기 위해, 우리는 그룹 부분 시퀀스 정책 최적화(GSsPO)를 도입합니다. GSsPO는 에이전트 추론의 교차하는 '생각-행동(Think-Action)' 역학에 맞춰 명시적으로 설계되었지만, 근본적으로는 광범위한 멀티 턴 에이전트 순차적 의사 결정 작업에 일반화할 수 있는 구조 인식 강화학습(RL) 알고리즘으로 기능합니다. 최적화 단위를 복합 부분 시퀀스, 구체적으로는 원자적(atomic) '생각-행동' 사이클로 재조정함으로써, 그라디언트 업데이트를 이러한 상호작용의 의미론적 경계와 일치시켜 복잡한 멀티 턴 추론 작업에서 견고한 학습을 보장합니다. 여러 QA 벤치마크에 대한 광범위한 실험을 통해 Workflow-R1은 경쟁력 있는 베이스라인들을 능가하였으며, 이를 통해 GSsPO가 순차적 추론을 위한 일반화된 솔루션임을 검증하고 Workflow-R1이 자동화된 워크플로우 최적화를 위한 유망한 새로운 패러다임임을 확립했습니다.

Original Abstract

The rapid evolution of agentic workflows has demonstrated strong performance of LLM-based agents in addressing complex reasoning tasks. However, existing workflow optimization methods typically formulate workflow synthesis as a static, one-shot code-centric generation problem. This paradigm imposes excessive constraints on the model's coding capabilities and restricts the flexibility required for dynamic problem-solving. In this paper, we present Workflow-R1, a framework that reformulates workflow construction as a multi-turn, natural language-based sequential decision-making process. To resolve the optimization granularity mismatch inherent in such multi-turn interactions, we introduce Group Sub-sequence Policy Optimization (GSsPO). While explicitly tailored to align with the interleaved Think-Action dynamics of agentic reasoning, GSsPO fundamentally functions as a structure-aware RL algorithm generalizable to a broad class of multi-turn agentic sequential decision-making tasks. By recalibrating the optimization unit to the composite sub-sequence, specifically the atomic Think-Action cycle, it aligns gradient updates with the semantic boundaries of these interactions, ensuring robust learning in complex multi-turn reasoning tasks. Through extensive experiments on multiple QA benchmarks, Workflow-R1 outperforms competitive baselines, validating GSsPO as a generalized solution for sequential reasoning and establishing Workflow-R1 as a promising new paradigm for automated workflow optimization.

1 Citations

1 Influential

2 Altmetric

13.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!