2603.08704v1 Mar 09, 2026 cs.AI

대규모 언어 모델의 금융 지능 평가: LLM 엔진을 활용한 SuperInvesting AI 벤치마킹

Evaluating Financial Intelligence in Large Language Models: Benchmarking SuperInvesting AI with LLM Engines

Vaibhav Singh

Citations: 31

h-index: 2

Kanha Singhania

Citations: 1

h-index: 1

Tushar Banga

Citations: 1

h-index: 1

Parth Arora

Citations: 14

h-index: 1

Anshul Verma

Citations: 1

h-index: 1

Agyapal Digra

Citations: 1

h-index: 1

Jayant Singh Bisht

Citations: 1

h-index: 1

Varun Singla

Citations: 60

h-index: 3

S. Garg

Citations: 1

h-index: 1

A. Gulati

Citations: 270

h-index: 5

Danish Sharma

Citations: 1

h-index: 1

대규모 언어 모델은 금융 분석 및 투자 연구에 점점 더 많이 활용되고 있지만, 이들의 금융적 추론 능력에 대한 체계적인 평가는 여전히 제한적입니다. 본 연구에서는 AI 금융 지능 벤치마크(AFIB)를 소개합니다. AFIB는 사실 정확성, 분석적 완전성, 데이터 최신성, 모델 일관성, 오류 패턴의 다섯 가지 측면을 평가하기 위한 다차원 평가 프레임워크입니다. GPT, Gemini, Perplexity, Claude, 그리고 SuperInvesting를 포함한 5개의 AI 시스템을 평가하기 위해 실제 주식 연구 과제에서 파생된 95개 이상의 구조화된 금융 분석 질문 데이터 세트를 사용했습니다. 결과는 모델 간 성능에 상당한 차이가 있음을 보여줍니다. 벤치마크 환경에서 SuperInvesting는 가장 높은 종합 성능을 달성했으며, 평균 사실 정확도 점수는 8.96/10점, 완전성 점수는 56.65/70점으로 가장 높았습니다. 또한 평가된 시스템 중 가장 낮은 환각률을 보였습니다. Perplexity와 같이 검색 기반 시스템은 실시간 정보 접근으로 인해 데이터 최신성 측면에서 강점을 보이지만, 분석적 종합 능력과 일관성 측면에서는 상대적으로 약점을 보였습니다. 전반적으로, 본 연구 결과는 대규모 언어 모델의 금융 지능이 본질적으로 다차원적이며, 구조화된 금융 데이터 접근과 분석적 추론 능력을 결합한 시스템이 복잡한 투자 연구 워크플로우에서 가장 신뢰할 수 있는 성능을 제공한다는 것을 강조합니다.

Original Abstract

Large language models are increasingly used for financial analysis and investment research, yet systematic evaluation of their financial reasoning capabilities remains limited. In this work, we introduce the AI Financial Intelligence Benchmark (AFIB), a multi-dimensional evaluation framework designed to assess financial analysis capabilities across five dimensions: factual accuracy, analytical completeness, data recency, model consistency, and failure patterns. We evaluate five AI systems: GPT, Gemini, Perplexity, Claude, and SuperInvesting, using a dataset of 95+ structured financial analysis questions derived from real-world equity research tasks. The results reveal substantial differences in performance across models. Within this benchmark setting, SuperInvesting achieves the highest aggregate performance, with an average factual accuracy score of 8.96/10 and the highest completeness score of 56.65/70, while also demonstrating the lowest hallucination rate among evaluated systems. Retrieval-oriented systems such as Perplexity perform strongly on data recency tasks due to live information access but exhibit weaker analytical synthesis and consistency. Overall, the results highlight that financial intelligence in large language models is inherently multi-dimensional, and systems that combine structured financial data access with analytical reasoning capabilities provide the most reliable performance for complex investment research workflows.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!