2602.11304v1 Feb 11, 2026 cs.CR

CryptoAnalystBench: 다중 도구 기반의 장문 LLM 분석에서의 실패 사례

CryptoAnalystBench: Failures in Multi-Tool Long-Form LLM Analysis

A. Eswaran

Citations: 5

h-index: 1

Oleg Golev

Citations: 1

h-index: 1

Darshan Tank

Citations: 0

h-index: 0

Himanshu Tyagi

Citations: 1

h-index: 1

S. Rahi

Citations: 101

h-index: 5

현대적인 분석 에이전트는 수십 개의 검색된 문서, 도구 출력 및 시의성 있는 데이터를 포함하는 복잡하고 방대한 텍스트 입력을 처리해야 합니다. 기존 연구에서는 도구 호출 벤치마크 개발 및 지식 증강 시스템에서의 사실성 검증이 이루어졌지만, LLM이 대량의 동적이고 구조화된, 그리고 비정형적인 다중 도구 출력을 통합해야 하는 환경과 같은 이러한 요소들의 상호 작용을 연구하는 연구는 상대적으로 부족합니다. 본 연구에서는 암호화폐를 대표적인 고밀도 데이터 영역으로 사용하여 이 환경에서의 LLM의 실패 사례를 조사합니다. 우리는 (1) 198개의 실제 암호화폐 및 탈중앙화 금융(DeFi) 쿼리를 포함하는 분석 지향 벤치마크인 CryptoAnalystBench, (2) 관련 암호화폐 및 DeFi 도구가 탑재된 에이전트 기반 시스템을 구축하여 여러 최첨단 LLM에서 응답을 생성하고, (3) 인용 검증 및 LLM을 활용한 평가 기준을 통해 사용자가 정의한 네 가지 성공 지표(관련성, 시간적 관련성, 깊이, 데이터 일관성)를 평가하는 평가 파이프라인을 소개합니다. 인간 어노테이션을 통해, 기존의 사실성 검사나 LLM 기반 품질 평가로는 정확하게 파악하기 어려운 7가지 고차원 오류 유형의 분류 체계를 개발했습니다. 연구 결과, 이러한 실패는 최첨단 시스템에서도 지속적으로 발생하며 중요한 의사 결정에 영향을 미칠 수 있습니다. 이 분류 체계를 기반으로, 평가 기준을 개선하여 이러한 오류를 보다 정확하게 파악하도록 했습니다. 평가 기준이 인간 어노테이터와 정확한 점수 측면에서 일치하지는 않지만, 중요한 실패 사례를 안정적으로 식별하여 개발자 및 연구자들이 분석형 에이전트를 연구하는 데 필요한 확장 가능한 피드백을 제공합니다. CryptoAnalystBench는 어노테이션된 쿼리, 평가 파이프라인, 평가 기준 및 오류 분류 체계와 함께 제공되며, 장문 및 다중 도구 기반 시스템을 평가하는 데 필요한 완화 전략 및 미해결 과제를 제시합니다.

Original Abstract

Modern analyst agents must reason over complex, high token inputs, including dozens of retrieved documents, tool outputs, and time sensitive data. While prior work has produced tool calling benchmarks and examined factuality in knowledge augmented systems, relatively little work studies their intersection: settings where LLMs must integrate large volumes of dynamic, structured and unstructured multi tool outputs. We investigate LLM failure modes in this regime using crypto as a representative high data density domain. We introduce (1) CryptoAnalystBench, an analyst aligned benchmark of 198 production crypto and DeFi queries spanning 11 categories; (2) an agentic harness equipped with relevant crypto and DeFi tools to generate responses across multiple frontier LLMs; and (3) an evaluation pipeline with citation verification and an LLM as a judge rubric spanning four user defined success dimensions: relevance, temporal relevance, depth, and data consistency. Using human annotation, we develop a taxonomy of seven higher order error types that are not reliably captured by factuality checks or LLM based quality scoring. We find that these failures persist even in state of the art systems and can compromise high stakes decisions. Based on this taxonomy, we refine the judge rubric to better capture these errors. While the judge does not align with human annotators on precise scoring across rubric iterations, it reliably identifies critical failure modes, enabling scalable feedback for developers and researchers studying analyst style agents. We release CryptoAnalystBench with annotated queries, the evaluation pipeline, judge rubrics, and the error taxonomy, and outline mitigation strategies and open challenges in evaluating long form, multi tool augmented systems.

1 Citations

0 Influential

2.5 Altmetric

13.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!