2601.18631v2 Jan 26, 2026 cs.AI

AdaReasoner: 반복적인 시각적 추론을 위한 동적 도구 오케스트레이션

AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning

Mingyang Song

Citations: 248

h-index: 5

Haoyu Sun

Citations: 23

h-index: 2

Jiawei Gu

National University of Singapore

Citations: 1,835

h-index: 9

Linjie Li

Citations: 221

h-index: 5

Ranjay Krishna

Citations: 896

h-index: 10

Yu Cheng

Citations: 499

h-index: 6

Luxin Xu

Citations: 27

h-index: 3

사람들이 자신의 능력 범위를 넘어서는 문제를 해결할 때, 도구를 활용하는 것은 다중 모드 대규모 언어 모델(MLLM)의 시각적 추론 능력을 향상시키는 유망한 방법입니다. 효과적인 추론은 어떤 도구를 사용할지, 언제 사용할지, 그리고 여러 단계를 거쳐 어떻게 조합할지를 아는 데 달려 있으며, 특히 새로운 도구나 새로운 작업에 직면했을 때 더욱 중요합니다. 본 논문에서는 **AdaReasoner**라는 다중 모드 모델 패밀리를 소개합니다. AdaReasoner은 도구 사용을 특정 도구에 국한된 행동이나 명시적인 지도 학습에 의존하는 것이 아니라, 일반적인 추론 능력으로 학습합니다. AdaReasoner은 다음과 같은 요소들에 의해 가능합니다: (i) 모델이 장기적인, 다단계 도구 상호 작용을 경험할 수 있도록 하는 확장 가능한 데이터 큐레이션 파이프라인; (ii) 최종 작업 성공을 기반으로 도구 선택과 순서를 최적화하는 강화 학습 알고리즘인 Tool-GRPO; 그리고 (iii) 도구 사용을 동적으로 조절하는 적응적 학습 메커니즘. 이러한 요소들은 모델이 작업 컨텍스트와 중간 결과를 통해 도구의 유용성을 추론하고, 여러 도구를 조정하며, 이전에 보지 못한 도구에 대한 일반화 능력을 갖도록 합니다. 실험적으로, AdaReasoner은 강력한 도구 적응 및 일반화 능력을 보여줍니다. 즉, 모델은 명시적으로 학습하지 않았음에도 불구하고 유용한 도구를 자율적으로 채택하고, 관련 없는 도구를 억제하며, 작업 요구 사항에 따라 도구 사용 빈도를 조정합니다. 이러한 능력은 VSP 및 Jigsaw를 포함한 여러 어려운 벤치마크에서 최첨단 성능을 달성하며, 평균적으로 7B 기본 모델의 성능을 +24.9% 향상시키고 GPT-5와 같은 강력한 독점 시스템을 능가합니다.

Original Abstract

When humans face problems beyond their immediate capabilities, they rely on tools, providing a promising paradigm for improving visual reasoning in multimodal large language models (MLLMs). Effective reasoning, therefore, hinges on knowing which tools to use, when to invoke them, and how to compose them over multiple steps, even when faced with new tools or new tasks. We introduce \textbf{AdaReasoner}, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior. AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that optimizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage. Together, these components allow models to infer tool utility from task context and intermediate outcomes, enabling coordination of multiple tools and generalization to unseen tools. Empirically, AdaReasoner exhibits strong tool-adaptive and generalization behaviors: it autonomously adopts beneficial tools, suppresses irrelevant ones, and adjusts tool usage frequency based on task demands, despite never being explicitly trained to do so. These capabilities translate into state-of-the-art performance across challenging benchmarks, improving the 7B base model by +24.9\% on average and surpassing strong proprietary systems such as GPT-5 on multiple tasks, including VSP and Jigsaw.

5 Citations

0 Influential

5 Altmetric

30.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!