2601.07641v1 Jan 12, 2026 cs.AI

정적 도구를 넘어서: 과학적 추론을 위한 테스트 시점 도구 진화

Beyond Static Tools: Test-Time Tool Evolution for Scientific Reasoning

Jiaxuan Lu

Citations: 46

h-index: 4

Ziyu Kong

Citations: 6

h-index: 1

Haiyuan Wan

Citations: 74

h-index: 6

Wenjie Lou

Citations: 45

h-index: 4

Haoran Sun

Citations: 12

h-index: 2

Lilong Wang

Citations: 88

h-index: 5

Yankai Jiang

Citations: 81

h-index: 5

Xiaosong Wang

Citations: 12

h-index: 2

Xiao Sun

Citations: 11

h-index: 2

Dongzhan Zhou

Citations: 45

h-index: 4

Yemin Wang

Citations: 40

h-index: 3

Rong Fu

Citations: 45

h-index: 4

Cheng Yang

Tencent

Citations: 101

h-index: 6

과학을 위한 AI(AI for Science)의 핵심 과제는 단순한 추론이 아니라, 개방형 과학 세계에서 계산 방법을 생성하는 능력에 있다. 기존의 LLM 기반 에이전트들은 정적이고 사전에 정의된 도구 라이브러리에 의존하는데, 이는 도구가 희소하고 이질적이며 본질적으로 불완전한 과학 도메인에서는 근본적으로 실패하는 패러다임이다. 본 논문에서는 에이전트가 추론 중에 실행 가능한 도구를 합성, 검증 및 진화시킬 수 있도록 하는 새로운 패러다임인 테스트 시점 도구 진화(Test-Time Tool Evolution, TTE)를 제안한다. TTE는 도구를 고정된 자원에서 문제 중심의 산출물로 변환함으로써 정적 도구 라이브러리의 경직성과 롱테일(long-tail) 한계를 극복한다. 엄격한 평가를 돕기 위해, 925개의 자동 진화된 도구로 지원되는 1,590개의 과학적 추론 작업으로 구성된 벤치마크인 SciEvo를 소개한다. 광범위한 실험 결과, TTE는 정확도와 도구 효율성 모두에서 최고 수준의 성능(SOTA)을 달성하는 동시에 계산 도구의 효과적인 교차 도메인 적응을 가능하게 하는 것으로 나타났다. 코드와 벤치마크는 https://github.com/lujiaxuan0520/Test-Time-Tool-Evol 에 공개되었다.

Original Abstract

The central challenge of AI for Science is not reasoning alone, but the ability to create computational methods in an open-ended scientific world. Existing LLM-based agents rely on static, pre-defined tool libraries, a paradigm that fundamentally fails in scientific domains where tools are sparse, heterogeneous, and intrinsically incomplete. In this paper, we propose Test-Time Tool Evolution (TTE), a new paradigm that enables agents to synthesize, verify, and evolve executable tools during inference. By transforming tools from fixed resources into problem-driven artifacts, TTE overcomes the rigidity and long-tail limitations of static tool libraries. To facilitate rigorous evaluation, we introduce SciEvo, a benchmark comprising 1,590 scientific reasoning tasks supported by 925 automatically evolved tools. Extensive experiments show that TTE achieves state-of-the-art performance in both accuracy and tool efficiency, while enabling effective cross-domain adaptation of computational tools. The code and benchmark have been released at https://github.com/lujiaxuan0520/Test-Time-Tool-Evol.

6 Citations

0 Influential

42.033312448852 Altmetric

216.2 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!