2601.11147v1 Jan 16, 2026 cs.AI

항상 쿼리 수준의 워크플로가 필요한가? 다중 에이전트 시스템을 위한 에이전트 워크플로 생성에 대한 재고

Do We Always Need Query-Level Workflows? Rethinking Agentic Workflow Generation for Multi-Agent Systems

Bingbing Xu

Citations: 1,223

h-index: 12

Huawei Shen

Citations: 51

h-index: 4

Zixu Wang

Citations: 15

h-index: 2

Yige Yuan

Citations: 193

h-index: 7

Xueqi Cheng

Citations: 1,916

h-index: 22

대규모 언어 모델을 기반으로 구축된 다중 에이전트 시스템(MAS)은 일반적으로 워크플로를 통해 여러 에이전트를 조정함으로써 복잡한 작업을 해결합니다. 기존 접근 방식들은 작업(task) 수준이나 쿼리(query) 수준에서 워크플로를 생성하지만, 그 상대적인 비용과 이점은 여전히 불분명합니다. 본 연구에서는 재고와 실증적 분석을 통해 쿼리 수준의 워크플로 생성이 항상 필수적인 것은 아님을 보여줍니다. 이는 소수의 상위 K개 최적 작업 수준 워크플로 집합만으로도 이미 동등하거나 그 이상의 쿼리를 커버할 수 있기 때문입니다. 또한, 우리는 포괄적인 실행 기반의 작업 수준 평가가 막대한 토큰 비용을 발생시킬 뿐만 아니라 신뢰성이 떨어지는 경우가 많다는 것을 발견했습니다. 자가 진화 및 생성적 보상 모델링 아이디어에서 영감을 받아, 우리는 저비용 작업 수준 생성 프레임워크인 SCALE을 제안합니다. SCALE은 전체 검증 실행 대신 평가를 위해 퓨샷(few-shot) 보정을 통한 최적화 도구의 자가 예측(Self prediction of the optimizer with few shot CALibration for Evaluation)을 수행합니다. 광범위한 실험을 통해 SCALE이 여러 데이터셋에 걸쳐 기존 접근 방식 대비 평균 0.61%의 미미한 성능 저하만으로 경쟁력 있는 성능을 유지하면서, 전체 토큰 사용량을 최대 83%까지 절감함을 입증했습니다.

Original Abstract

Multi-Agent Systems (MAS) built on large language models typically solve complex tasks by coordinating multiple agents through workflows. Existing approaches generates workflows either at task level or query level, but their relative costs and benefits remain unclear. After rethinking and empirical analyses, we show that query-level workflow generation is not always necessary, since a small set of top-K best task-level workflows together already covers equivalent or even more queries. We further find that exhaustive execution-based task-level evaluation is both extremely token-costly and frequently unreliable. Inspired by the idea of self-evolution and generative reward modeling, we propose a low-cost task-level generation framework \textbf{SCALE}, which means \underline{\textbf{S}}elf prediction of the optimizer with few shot \underline{\textbf{CAL}}ibration for \underline{\textbf{E}}valuation instead of full validation execution. Extensive experiments demonstrate that \textbf{SCALE} maintains competitive performance, with an average degradation of just 0.61\% compared to existing approach across multiple datasets, while cutting overall token usage by up to 83\%.

1 Citations

1 Influential

11 Altmetric

58.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!