2604.14951v1 Apr 16, 2026 cs.CV

RaTA-Tool: 다중 모드 대규모 언어 모델을 활용한 검색 기반 도구 선택 방법

RaTA-Tool: Retrieval-based Tool Selection with Multimodal Large Language Models

Marcella Cornia

Citations: 5,876

h-index: 32

L. Baraldi

Citations: 6,569

h-index: 36

Rita Cucchiara

Citations: 435

h-index: 12

Sara Sarto

Citations: 689

h-index: 11

Evelyn Turri

Citations: 13

h-index: 2

Gabriel Mattioli

Citations: 35

h-index: 4

기반 모델을 활용한 도구 학습은 AI 시스템이 외부 리소스(API, 계산 유틸리티, 전문 모델 등)를 활용하여 독립적인 언어 생성 능력으로는 해결하기 어려운 복잡한 작업을 수행할 수 있도록 하는 것을 목표로 합니다. 최근 대규모 언어 모델(LLM) 및 다중 모드 대규모 언어 모델(MLLM)의 발전은 추론 및 인지 능력을 향상시켰지만, 기존의 도구 사용 방법은 주로 텍스트 기반 입력과 폐쇄형 환경에 국한되어 있습니다. 결과적으로, 이러한 방법들은 다중 모드 사용자 지시 사항을 해석하는 데 어려움을 겪으며, 학습 과정에서 사용되지 않은 도구에 대한 일반화가 어렵습니다. 본 연구에서는 개방형 환경의 다중 모드 도구 선택을 위한 새로운 프레임워크인 RaTA-Tool을 소개합니다. RaTA-Tool은 사용자 쿼리와 특정 도구 식별자 간의 직접적인 매핑을 학습하는 대신, MLLM이 다중 모드 쿼리를 구조화된 작업 설명으로 변환하고, 이 표현을 의미적으로 풍부하고 기계 판독 가능한 도구 설명과 비교하여 가장 적합한 도구를 검색하도록 합니다. 이러한 검색 기반 접근 방식은 재학습 없이 새로운 도구를 쉽게 추가할 수 있도록 지원합니다. 작업 설명과 도구 선택 간의 정렬을 더욱 개선하기 위해, Direct Preference Optimization (DPO)을 사용한 선호도 기반 최적화 단계를 포함했습니다. 또한, 본 연구에서는 개방형 환경의 다중 모드 도구 사용을 위한 첫 번째 데이터 세트를 소개하며, 이 데이터 세트는 Hugging Face 모델 카드에서 파생된 표준화된 도구 설명을 포함합니다. 광범위한 실험 결과, RaTA-Tool은 특히 개방형 환경의 다중 모드 시나리오에서 도구 선택 성능을 크게 향상시키는 것으로 나타났습니다.

Original Abstract

Tool learning with foundation models aims to endow AI systems with the ability to invoke external resources -- such as APIs, computational utilities, and specialized models -- to solve complex tasks beyond the reach of standalone language generation. While recent advances in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have expanded their reasoning and perception capabilities, existing tool-use methods are predominantly limited to text-only inputs and closed-world settings. Consequently, they struggle to interpret multimodal user instructions and cannot generalize to tools unseen during training. In this work, we introduce RaTA-Tool, a novel framework for open-world multimodal tool selection. Rather than learning direct mappings from user queries to fixed tool identifiers, our approach enables an MLLM to convert a multimodal query into a structured task description and subsequently retrieve the most appropriate tool by matching this representation against semantically rich, machine-readable tool descriptions. This retrieval-based formulation naturally supports extensibility to new tools without retraining. To further improve alignment between task descriptions and tool selection, we incorporate a preference-based optimization stage using Direct Preference Optimization (DPO). To support research in this setting, we also introduce the first dataset for open-world multimodal tool use, featuring standardized tool descriptions derived from Hugging Face model cards. Extensive experiments demonstrate that our approach significantly improves tool-selection performance, particularly in open-world, multimodal scenarios.

0 Citations

0 Influential

18 Altmetric

90.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!