2603.08262v1 Mar 09, 2026 cs.AI

FinToolBench: 실제 금융 도구 활용을 위한 LLM 에이전트 평가

FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use

Jiaxuan Lu

Citations: 46

h-index: 4

Xiao Sun

Citations: 11

h-index: 2

Yemin Wang

Citations: 40

h-index: 3

Hongwei Zeng

Citations: 162

h-index: 7

Kong Wang

Citations: 3

h-index: 1

Qingmei Tang

Citations: 26

h-index: 3

Xiang Chen

Citations: 24

h-index: 2

Jiahao Pi

Citations: 157

h-index: 5

Shujian Deng

Citations: 3

h-index: 1

Lingzhi Chen

Citations: 20

h-index: 3

Yize Fu

Citations: 3

h-index: 1

Ke Yang

Citations: 3

h-index: 1

대규모 언어 모델(LLM)이 금융 분야에 통합되면서, 단순한 정보 검색에서 벗어나 역동적이고 능동적인 상호 작용을 가능하게 하는 패러다임 전환이 일어나고 있습니다. 일반적인 도구 학습 분야에서는 다양한 벤치마크가 등장했지만, 높은 위험성, 엄격한 규제 준수, 빠른 데이터 변동성을 특징으로 하는 금융 분야는 여전히 중요하게 간과되고 있습니다. 기존의 금융 관련 평가는 주로 정적인 텍스트 분석 또는 문서 기반 질의응답에 집중하며, 실제 도구 실행의 복잡성을 간과합니다. 반면, 일반적인 도구 벤치마크는 금융 분야에 필요한 전문성을 갖추지 못하는 경우가 많으며, 종종 단순화된 환경이나 제한된 수의 금융 API에 의존합니다. 이러한 격차를 해소하기 위해, 실제 금융 도구 학습 에이전트를 평가하기 위한 최초의 실세계 기반 벤치마크인 FinToolBench를 소개합니다. 기존 연구들이 제한된 수의 시뮬레이션 도구에만 초점을 맞춘 것과 달리, FinToolBench는 760개의 실행 가능한 금융 도구와 295개의 엄격하고 도구 사용이 필수적인 질의를 결합하여 현실적인 환경을 구축합니다. 우리는 단순한 실행 성공 여부를 넘어, 금융 분야의 중요한 측면인 적시성, 의도 유형, 규제 준수 여부를 평가하는 새로운 평가 프레임워크를 제안합니다. 또한, 금융 분야에 특화된 도구 검색 및 추론의 기본 모델인 FATR을 제시하여 안정성과 규제 준수를 강화합니다. FinToolBench는 감사 가능한, 능동적인 금융 실행을 위한 첫 번째 테스트 환경을 제공함으로써, 금융 분야의 신뢰할 수 있는 AI에 대한 새로운 기준을 제시합니다. 도구 목록, 실행 환경 및 평가 코드는 향후 연구를 촉진하기 위해 공개될 예정입니다.

Original Abstract

The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.

3 Citations

1 Influential

3.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!