2604.09455v1 Apr 10, 2026 cs.AI

E3-TIR: 도구 통합 추론을 위한 향상된 경험 활용

E3-TIR: Enhanced Experience Exploitation for Tool-Integrated Reasoning

Min Zhang

Citations: 17

h-index: 3

Weiyang Guo

Citations: 51

h-index: 4

Zesheng Shi

Citations: 36

h-index: 4

Liye Zhao

Citations: 4

h-index: 1

Zeen Zhu

Citations: 7

h-index: 2

Junxian He

Citations: 353

h-index: 5

Jing Li

Citations: 7

h-index: 2

Jiayuan Ma

Citations: 32

h-index: 3

대규모 언어 모델(LLM)은 도구 통합 추론(TIR) 분야에서 상당한 잠재력을 보여주었지만, 기존의 훈련 방식은 다음과 같은 중요한 한계를 가지고 있습니다. 강화 학습(Zero-RL)은 사전 지침의 부족으로 인해 비효율적인 탐색과 성능 저하를 겪고 있으며, 지도 학습 후 강화 학습(SFT-then-RL)은 높은 데이터 비용과 낮은 엔트로피로 인한 성능 정체라는 문제점을 안고 있습니다. 이러한 문제점을 해결하기 위해, 우리는 초기 에이전트 훈련 단계에서 효율적인 경험 활용을 위한 새로운 패러다임인 E3-TIR (Enhanced Experience Exploitation)을 제안합니다. 구체적으로, 우리는 훈련을 세 가지 유형의 경험, 즉 전문가 기반 프레픽스(Expert Prefixes), 전문가 가이드(Expert Guided), 그리고 자기 탐색(Self-Exploration)의 동적 통합으로 구성합니다. 전문가의 "기준"을 중심으로 다양한 탐색을 수행하고, 혼합 정책 최적화 메커니즘을 활용하여, 공유된 프레픽스로 인해 발생하는 분포 변화 및 최적화 충돌을 효과적으로 완화합니다. 우리의 방법은 모델의 지식 경계를 동적으로 조정하여, 탐색의 다양성과 훈련 효율성을 균형 있게 유지합니다. 실험 결과는 E3-TIR이 기존 방식에 비해 도구 사용 작업에서 6%의 성능 향상을 달성했으며, 필요한 합성 데이터의 양은 10분의 1 미만이라는 것을 보여줍니다. 또한, 성능, 데이터 비용 및 훈련 효율성을 종합적으로 평가하는 ROI 지표에서, E3-TIR은 기준 모델 대비 1.46배 더 높은 효율성을 보였습니다. 코드 및 관련 정보는 다음 링크에서 확인할 수 있습니다: https://github.com/yuki-younai/E3-TIR.

Original Abstract

While Large Language Models (LLMs) have demonstrated significant potential in Tool-Integrated Reasoning (TIR), existing training paradigms face significant limitations: Zero-RL suffers from inefficient exploration and mode degradation due to a lack of prior guidance, while SFT-then-RL is limited by high data costs and capability plateaus caused by low-entropy collapse. To address these challenges, we propose E3-TIR (Enhanced Experience Exploitation), a warm-up paradigm for the early stages of agent training. Specifically, we formulate training as the dynamic integration of three experience types: Expert Prefixes, Expert Guided, and Self-Exploration. By executing diverse branching exploration around expert "anchors" and employing a mix policy optimization mechanism, we effectively mitigate distribution shifts and resolve optimization conflicts arising from shared prefixes. Our method dynamically adapts the model's knowledge boundaries, effectively balancing exploration diversity with training efficiency.Experimental results demonstrate that E3-TIR achieves a 6 performance improvement over traditional paradigms on tool-use tasks, while requiring less than 10 of the synthetic data. Furthermore, in terms of ROI, a comprehensive metric integrating performance, data cost, and training efficiency we achieve a 1.46x gain compared to baselines. Code is available at https://github.com/yuki-younai/E3-TIR.

4 Citations

0 Influential

25.9657359028 Altmetric

133.8 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!