2603.07853v1 Mar 09, 2026 cs.AI

SynPlanResearch-R1: 합성 계획을 활용한 심층 연구를 위한 도구 탐색 촉진

SynPlanResearch-R1: Encouraging Tool Exploration for Deep Research with Synthetic Plans

Fengran Mo

Citations: 39

h-index: 2

Hansi Zeng

Citations: 10

h-index: 2

Z. Li

Citations: 35

h-index: 2

Yifan Gao

Citations: 294

h-index: 10

Chenwei Zhang

Citations: 162

h-index: 7

Xiaoman Pan

Citations: 26

h-index: 2

Tao Yang

Citations: 11

h-index: 2

Jiacheng Lin

Citations: 43

h-index: 3

Xian Li

Citations: 84

h-index: 4

Jingbo Shang

Citations: 60

h-index: 3

연구 에이전트는 사용자의 질문에 답변하기 위해 웹에서 정보를 수집하는 도구를 사용하며, 이 과정에서 내부적인 추론과 도구 사용을 동적으로 결합해야 합니다. 이러한 능력은 원칙적으로 강화 학습과 검증 가능한 보상(RLVR)을 통해 학습될 수 있지만, 우리는 에이전트들이 종종 조기 종료 및 편향된 도구 사용과 같은 불량한 탐색 행동을 보인다는 것을 관찰했습니다. 결과적으로, RLVR만으로는 제한적인 성능 향상만 얻을 수 있습니다. 우리는 SynPlanResearch-R1이라는 프레임워크를 제안합니다. 이 프레임워크는 더 깊은 탐색을 유도하는 도구 사용 경로를 합성하여, 초기 단계의 지도 학습 미세 조정 과정을 통해 탐색을 개선하고, 이후 강화 학습을 위한 강력한 초기 조건을 제공합니다. 7개의 다단계 및 공개 웹 벤치마크에서, 제안하는 프레임워크는 SOTA 기준에 비해 Qwen3-8B 모델에서 최대 6.0%, Qwen3-4B 모델에서 최대 5.8%의 성능 향상을 보였습니다. 또한, 기준 모델과 비교하여 도구 사용 패턴 및 학습 동역학에 대한 추가 분석을 통해 이러한 성능 향상의 근본적인 요인을 밝히고자 했습니다. 저희의 코드는 다음 링크에서 공개적으로 이용 가능합니다: https://github.com/HansiZeng/syn-plan-research.

Original Abstract

Research Agents enable models to gather information from the web using tools to answer user queries, requiring them to dynamically interleave internal reasoning with tool use. While such capabilities can in principle be learned via reinforcement learning with verifiable rewards (RLVR), we observe that agents often exhibit poor exploration behaviors, including premature termination and biased tool usage. As a result, RLVR alone yields limited improvements. We propose SynPlanResearch-R1, a framework that synthesizes tool-use trajectories that encourage deeper exploration to shape exploration during cold-start supervised fine-tuning, providing a strong initialization for subsequent RL. Across seven multi-hop and open-web benchmarks, \framework improves performance by up to 6.0% on Qwen3-8B and 5.8% on Qwen3-4B backbones respectively compared to SOTA baselines. Further analyses of tool-use patterns and training dynamics compared to baselines shed light on the factors underlying these gains. Our code is publicly available at https://github.com/HansiZeng/syn-plan-research.

3 Citations

0 Influential

33.047189562171 Altmetric

168.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!