2605.28359v1 May 27, 2026 cs.AI

From Knowing to Doing: A Memory-Controlled Benchmark for LLM Trading Agents on Stock Markets

Rui Sun
Rui Sun
Citations: 13
h-index: 2
Jing Li
Jing Li
Citations: 3
h-index: 1
Daxin Jiang
Daxin Jiang
Citations: 1,489
h-index: 15
Jiacheng Lu
Jiacheng Lu
Citations: 4
h-index: 1
Taojie Zhu
Taojie Zhu
Citations: 2
h-index: 1
Beidi Luan
Beidi Luan
Citations: 64
h-index: 1
Yonghong He
Yonghong He
Citations: 342
h-index: 9
Zuo Bai
Zuo Bai
Citations: 10
h-index: 2
Wentao Zhao
Wentao Zhao
Citations: 389
h-index: 9
Sin-Chong Wang
Sin-Chong Wang
Citations: 7
h-index: 1

Evaluating whether large language model (LLM) agents can profit in capital markets is increasingly framed as end-to-end trading: place an agent in a historical market, let it trade, and measure portfolio returns. This setup is vulnerable to two evaluation failures. First, long backtests often overlap with the knowledge cutoffs of frontier LLMs, allowing memorized tickers, dates, prices, and market narratives to substitute for investment reasoning. Second, raw returns are a noisy proxy for stock-selection ability, since positive performance may come from market beta, style exposure, or favorable regimes rather than genuine alpha. We introduce KTD-Fin (Knowing-To-Doing Financial Benchmark), an end-to-end stock-market trading benchmark that addresses both issues. KTD-Fin uses a data-side masking protocol to anonymize key identifiers and calendar information consistently across prompts and tools, separating historical market memory from investment decision-making. It also incorporates a Barra-style performance attribution framework that decomposes portfolio returns into market, style, and stock-selection alpha components. Across ten frontier LLM agents evaluated on the Chinese CSI300 over a 2024--2026 window, masking substantially changes agent rationales, pushing them towards anonymized factor-based reasoning. Attribution analysis further shows that LLM agents' cumulative returns under leakage-controlled evaluation are largely explained by passive market and style exposure, with limited evidence of persistent stock-selection alpha. These findings suggest that financial LLM benchmarks should evaluate not only whether an agent makes money, but also whether the source of returns reflects transferable investment skill. We release KTD-Fin as a reproducible template for leakage-controlled and attribution-aware evaluation of LLM trading agents.

0 Citations
0 Influential
7.5 Altmetric
37.5 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!