2604.27865v1 Apr 30, 2026 cs.AI

KellyBench: 장기 연속 의사 결정 평가를 위한 벤치마크

KellyBench: A Benchmark for Long-Horizon Sequential Decision Making

Iliyan Zarov

Citations: 31,166

h-index: 5

Ross Taylor

Citations: 32,338

h-index: 9

Thomas J. Grady

Citations: 105

h-index: 4

Kip Parker

Citations: 41

h-index: 2

Henry Course

Citations: 0

h-index: 0

Chengxi Taylor

Citations: 0

h-index: 0

언어 모델은 좁은 목표를 가진 절차적 작업 벤치마크에서 성능이 최고점에 도달하고 있습니다. 하지만 이 모델들은 점점 더 장기적이고 비정상적인 환경에서, 광범위한 목표를 가지고 사용되고 있습니다. 본 논문에서는 스포츠 베팅 시장에서의 연속 의사 결정을 평가하기 위한 환경인 KellyBench를 소개합니다. 에이전트는 2023-24 잉글랜드 프리미어 리그 시즌의 연속 시뮬레이션에 배치되어 장기적인 자산 증가를 극대화하는 임무를 수행합니다. 에이전트는 상세한 과거 데이터를 제공받으며, 여기에는 고급 통계, 선발 라인업, 공개 배당률 등이 포함됩니다. 성공하기 위해서는 에이전트는 머신러닝 모델을 구축하고, 공개 시장에서 유리한 조건을 파악하며, 시간이 지남에 따라 변화하는 환경에 적응해야 합니다. 평가 결과, 테스트된 모든 최첨단 모델은 5개의 시드(seed)에서 평균적으로 손실을 보였습니다. 가장 높은 성과를 보인 모델의 평균 수익률은 -8%였으며, 많은 모델에서 파산이 발생하는 경우도 있었습니다. 전략의 정교함을 평가하기 위해, 인간 전문가의 평가 기준을 사용하여 각 모델을 평가한 결과, 대부분의 모델이 인간 기준에 비해 정교하지 않다는 것을 확인했습니다. Claude Opus 4.6은 26.5%의 평가 점수를 받았으며, 이는 개선의 여지가 매우 크다는 것을 의미합니다. KellyBench는 https://openreward.ai/GeneralReasoning/KellyBench 에서 공개 API 엔드포인트로 제공됩니다.

Original Abstract

Language models are saturating benchmarks for procedural tasks with narrow objectives. But they are increasingly being deployed in long-horizon, non-stationary environments with open-ended goals. In this paper we introduce KellyBench, an environment for evaluating sequential decision-making in sports betting markets. Agents are placed in a sequential simulation of the 2023-24 English Premier League season and tasked with maximising their long-term bankroll growth. They are given detailed historical data, including advanced statistics, lineups, and public odds. To succeed they must build machine learning models, identify edge in public markets, and adapt as the environment changes over time. We find that all frontier models evaluated lose money on average over the course of the season for five seeds. The best performing model achieves an average return of -8%, and many models experiencing ruin across seeds. To judge strategy sophistication, we use a human expert rubric to grade each model and find their approaches to be unsophisticated compared to human baselines; Claude Opus 4.6 achieves a rubric score of 26.5%, which means there is significant room for improvement. KellyBench is available as an open-access API endpoint at https://openreward.ai/GeneralReasoning/KellyBench.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!