2601.10712v1 Jan 15, 2026 cs.CL

MatchTIR: 양방향 매칭을 통한 도구 통합 추론을 위한 미세 조정된 지도 학습

MatchTIR: Fine-Grained Supervision for Tool-Integrated Reasoning via Bipartite Matching

Shuaiqiang Wang

Citations: 2,668

h-index: 21

Dawei Yin

Citations: 1,554

h-index: 19

Hengyi Cai

Citations: 531

h-index: 10

Changle Qu

Citations: 395

h-index: 5

Sunhao Dai

Citations: 1,451

h-index: 15

Jun Xu

Citations: 452

h-index: 7

도구 통합 추론(TIR)은 대규모 언어 모델(LLM)이 외부 도구와의 상호 작용을 통해 추론 단계를 섞어 복잡한 작업을 해결할 수 있도록 합니다. 그러나 기존 강화 학습 방법은 일반적으로 결과 또는 경로 수준의 보상에 의존하며, 경로 내의 모든 단계에 동일한 가중치를 부여합니다. 이러한 거친 수준의 보상 할당은 효과적인 도구 호출과 중복되거나 오류가 있는 호출을 구별하지 못하며, 특히 장기적인 다단계 시나리오에서 문제가 됩니다. 이러한 문제를 해결하기 위해, 우리는 미세 조정된 지도 학습을 통해 세분화된 보상을 제공하고 이중 수준의 이점 추정 기능을 도입하는 MatchTIR 프레임워크를 제안합니다. 구체적으로, 우리는 예측된 경로와 실제 경로 간의 양방향 매칭 문제를 활용하여 신뢰도 할당을 수행하고, 두 가지 할당 전략을 사용하여 밀집된 단계별 보상을 도출합니다. 또한, 로컬 단계의 정확성과 전역 작업 성공을 균형 있게 유지하기 위해, 단계별 및 경로 수준의 신호를 통합하여 개별 상호 작용 단계에 서로 다른 이점 값을 할당하는 이중 수준의 이점 추정 방식을 도입했습니다. 세 가지 벤치마크에 대한 광범위한 실험 결과, MatchTIR의 우수성이 입증되었습니다. 특히, 40억 개의 파라미터를 가진 우리의 모델이 대부분의 80억 개의 파라미터를 가진 경쟁 모델보다 성능이 뛰어났으며, 특히 장기적인 다단계 작업에서 두드러진 성능을 보였습니다. 우리의 코드는 https://github.com/quchangle1/MatchTIR 에서 확인할 수 있습니다.

Original Abstract

Tool-Integrated Reasoning (TIR) empowers large language models (LLMs) to tackle complex tasks by interleaving reasoning steps with external tool interactions. However, existing reinforcement learning methods typically rely on outcome- or trajectory-level rewards, assigning uniform advantages to all steps within a trajectory. This coarse-grained credit assignment fails to distinguish effective tool calls from redundant or erroneous ones, particularly in long-horizon multi-turn scenarios. To address this, we propose MatchTIR, a framework that introduces fine-grained supervision via bipartite matching-based turn-level reward assignment and dual-level advantage estimation. Specifically, we formulate credit assignment as a bipartite matching problem between predicted and ground-truth traces, utilizing two assignment strategies to derive dense turn-level rewards. Furthermore, to balance local step precision with global task success, we introduce a dual-level advantage estimation scheme that integrates turn-level and trajectory-level signals, assigning distinct advantage values to individual interaction turns. Extensive experiments on three benchmarks demonstrate the superiority of MatchTIR. Notably, our 4B model surpasses the majority of 8B competitors, particularly in long-horizon and multi-turn tasks. Our codes are available at https://github.com/quchangle1/MatchTIR.

1 Citations

0 Influential

45.722612188617 Altmetric

229.6 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!