2601.15625v1 Jan 22, 2026 cs.LG

Fission-GRPO를 이용한 견고한 도구 사용: 실행 오류로부터 복구하는 방법 학습

Robust Tool Use via Fission-GRPO: Learning to Recover from Execution Errors

Rui Wang

Citations: 22

h-index: 3

Shaosheng Cao

Citations: 147

h-index: 6

Zhiwei Zhang

Citations: 4

h-index: 1

Fei Zhao

Citations: 5

h-index: 1

Zezhong Wang

Citations: 534

h-index: 10

Bin Liang

Citations: 137

h-index: 4

Jiakang Wang

Citations: 67

h-index: 4

Yao Hu

Citations: 85

h-index: 5

Kam-Fai Wong

Citations: 311

h-index: 10

대규모 언어 모델(LLM)은 도구를 효과적으로 호출할 수 있지만, 다단계 실행 과정에서 취약성을 드러냅니다. 도구 호출 오류가 발생하면, 작은 모델들은 종종 반복적인 잘못된 재호출로 인해 성능이 저하되고, 오류 피드백을 이해하고 스스로 수정하는 데 실패합니다. 이러한 취약성은 도구 상호 작용 절차에서 필연적으로 발생하는 실행 오류를 고려할 때 실제 환경에서의 신뢰성 있는 배포를 어렵게 만듭니다. 현재 접근 방식의 주요 한계를 파악했습니다. 표준 강화 학습(RL)은 오류를 희소한 부정적 보상으로 취급하여, 어떻게 복구해야 하는지에 대한 지침을 제공하지 않습니다. 또한, 사전에 수집된 합성 오류 수정 데이터 세트는 모델의 실시간 오류 발생 패턴과의 분포 불일치 문제를 가지고 있습니다. 이러한 간극을 해소하기 위해, 우리는 RL 훈련 루프 내에서 실행 오류를 수정 지침으로 변환하는 프레임워크인 Fission-GRPO를 제안합니다. 우리의 핵심 메커니즘은 각 실패한 트랙토리를 진단 피드백을 포함하는 새로운 훈련 인스턴스로 분할하고, 미세 조정된 오류 시뮬레이터로부터 얻은 정보를 바탕으로 복구 실행 경로를 실시간으로 재샘플링합니다. 이를 통해 모델은 탐색 과정에서 발생하는 정확한 오류로부터 학습할 수 있으며, 이는 정적이고 미리 수집된 오류 사례로부터 학습하는 것보다 효과적입니다. BFCL v4 Multi-Turn 데이터 세트에서, Fission-GRPO는 Qwen3-8B의 오류 복구율을 5.7%p 향상시켰습니다. 더욱 중요하게는, Fission-GRPO는 GRPO보다 4%p의 전체 정확도 향상(42.75%에서 46.75%로)을 달성했으며, 특수 도구 사용 에이전트보다 우수한 성능을 보였습니다.

Original Abstract

Large language models (LLMs) can call tools effectively, yet they remain brittle in multi-turn execution: following a tool call error, smaller models often degenerate into repetitive invalid re-invocations, failing to interpret error feedback and self-correct. This brittleness hinders reliable real-world deployment, where the execution errors are inherently inevitable during tool interaction procedures. We identify a key limitation of current approaches: standard reinforcement learning (RL) treats errors as sparse negative rewards, providing no guidance on how to recover, while pre-collected synthetic error-correction datasets suffer from distribution mismatch with the model's on-policy error modes. To bridge this gap, we propose Fission-GRPO, a framework that converts execution errors into corrective supervision within the RL training loop. Our core mechanism fissions each failed trajectory into a new training instance by augmenting it with diagnostic feedback from a finetuned Error Simulator, then resampling recovery rollouts on-policy. This enables the model to learn from the precise errors it makes during exploration, rather than from static, pre-collected error cases. On the BFCL v4 Multi-Turn, Fission-GRPO improves the error recovery rate of Qwen3-8B by 5.7% absolute, crucially, yielding a 4% overall accuracy gain (42.75% to 46.75%) over GRPO and outperforming specialized tool-use agents.

4 Citations

0 Influential

5 Altmetric

29.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!