2601.22607v1 Jan 30, 2026 cs.AI

자가 진화 합성 데이터부터 검증 가능한 보상 기반 강화학습까지: 멀티턴 상호작용 도구 사용 에이전트의 사후 학습

From Self-Evolving Synthetic Data to Verifiable-Reward RL: Post-Training Multi-turn Interactive Tool-Using Agents

Hanrui Wang

Citations: 44

h-index: 3

Di Jin

Citations: 24

h-index: 2

Chuyi He

Citations: 376

h-index: 3

Jiaxuan Gao

Citations: 1,006

h-index: 10

Yi Wu

Citations: 809

h-index: 9

Shusheng Xu

IIIS, Tsinghua University

Citations: 898

h-index: 10

Jiaao Chen

Citations: 839

h-index: 17

Wei-Chen Wang

Citations: 203

h-index: 3

상호작용 도구 사용 에이전트는 인간 및 외부 환경과의 멀티턴 상호작용을 통해 실제 과제를 해결해야 하며, 이를 위해서는 대화 상태 추적, 다단계 도구 실행, 그리고 복잡한 지시 사항 준수가 필요하다. 이러한 에이전트의 사후 학습(post-training)은 고품질의 멀티턴 도구 사용 데이터 합성을 확장하기 어렵고, 강화학습(RL)이 사용자 시뮬레이션으로 인한 잡음 신호에 직면하여 학습 효율을 저하시킬 수 있기 때문에 도전적인 과제이다. 우리는 자가 진화 데이터 에이전트와 검증기(verifier) 기반 강화학습을 결합한 통합 프레임워크를 제안한다. 우리의 시스템인 EigenData는 도구 기반 대화를 실행 가능한 인스턴스별 검사기(checker)와 함께 합성하는 계층적 멀티 에이전트 엔진으로, 프롬프트와 워크플로우를 업데이트하는 폐루프(closed-loop) 자가 진화 프로세스를 통해 생성의 신뢰성을 향상시킨다. 합성 데이터를 기반으로, 우리는 먼저 사용자 모델을 미세 조정한 다음 궤적(trajectory) 수준의 그룹 상대적 이점(group-relative advantages)과 동적 필터링을 사용하는 GRPO 스타일의 학습을 적용하여 SFT를 뛰어넘는 일관된 성능 향상을 이끌어내는 RL 레시피를 개발했다. tau^2-bench에서 평가한 결과, 우리의 최고 모델은 Airline에서 73.0%의 pass^1, Telecom에서 98.3%의 pass^1을 달성하여 최신 프론티어 모델들과 대등하거나 이를 능가하는 성능을 보였다. 전반적으로, 우리의 결과는 비용이 많이 드는 인간 주석 없이도 복잡한 도구 사용 행동을 부트스트래핑할 수 있는 확장 가능한 경로를 제시한다.

Original Abstract

Interactive tool-using agents must solve real-world tasks via multi-turn interaction with both humans and external environments, requiring dialogue state tracking, multi-step tool execution, while following complex instructions. Post-training such agents is challenging because synthesis for high-quality multi-turn tool-use data is difficult to scale, and reinforcement learning (RL) could face noisy signals caused by user simulation, leading to degraded training efficiency. We propose a unified framework that combines a self-evolving data agent with verifier-based RL. Our system, EigenData, is a hierarchical multi-agent engine that synthesizes tool-grounded dialogues together with executable per-instance checkers, and improves generation reliability via closed-loop self-evolving process that updates prompts and workflow. Building on the synthetic data, we develop an RL recipe that first fine-tunes the user model and then applies GRPO-style training with trajectory-level group-relative advantages and dynamic filtering, yielding consistent improvements beyond SFT. Evaluated on tau^2-bench, our best model reaches 73.0% pass^1 on Airline and 98.3% pass^1 on Telecom, matching or exceeding frontier models. Overall, our results suggest a scalable pathway for bootstrapping complex tool-using behaviors without expensive human annotation.

3 Citations

1 Influential

8.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!