2603.16453v1 Mar 17, 2026 cs.AI

RetailBench: 현실적인 소매 환경에서 LLM 에이전트의 장기적인 자율 의사 결정 및 전략 안정성 평가

RetailBench: Evaluating Long-Horizon Autonomous Decision-Making and Strategy Stability of LLM Agents in Realistic Retail Environments

Jun Wang

Citations: 79

h-index: 5

Linghua Zhang

Citations: 17

h-index: 2

Zhisong Zhang

Citations: 196

h-index: 3

Jing Wu

Citations: 18

h-index: 2

대규모 언어 모델(LLM) 기반 에이전트는 짧은 기간 동안의 구조화된 작업에서 상당한 성공을 거두었지만, 현실적이고 역동적인 환경에서 장기적인 일관된 의사 결정을 유지하는 능력은 여전히 해결해야 할 과제입니다. 저희는 현실적인 상업 시나리오에서 장기적인 자율 의사 결정을 평가하기 위해 설계된 고정밀 벤치마크인 RetailBench를 소개합니다. 여기서 에이전트는 확률적인 수요와 변화하는 외부 조건 하에서 운영해야 합니다. 더 나아가, 저희는 고수준의 전략적 추론과 저수준의 행동 실행을 분리하는 Evolving Strategy & Execution 프레임워크를 제안합니다. 이 설계는 시간이 지남에 따라 적응적이고 해석 가능한 전략 진화를 가능하게 합니다. 이는 특히 비정상적인 환경과 오류 누적이 장기적인 작업에서 요구되므로, 전략을 행동 실행과는 다른 시간 척도로 수정해야 하는 경우에 중요합니다. 최첨단 LLM 8개를 대상으로 점진적으로 복잡해지는 환경에서 수행한 실험 결과, 저희 프레임워크는 다른 기준 모델과 비교하여 운영 안정성과 효율성을 향상시키는 것으로 나타났습니다. 그러나 작업 복잡성이 증가함에 따라 성능이 크게 저하되어, 현재 LLM이 장기적이고 다중 요인을 고려한 의사 결정에 있어 근본적인 한계를 가지고 있음을 보여줍니다.

Original Abstract

Large Language Model (LLM)-based agents have achieved notable success on short-horizon and highly structured tasks. However, their ability to maintain coherent decision-making over long horizons in realistic and dynamic environments remains an open challenge. We introduce RetailBench, a high-fidelity benchmark designed to evaluate long-horizon autonomous decision-making in realistic commercial scenarios, where agents must operate under stochastic demand and evolving external conditions. We further propose the Evolving Strategy & Execution framework, which separates high-level strategic reasoning from low-level action execution. This design enables adaptive and interpretable strategy evolution over time. It is particularly important for long-horizon tasks, where non-stationary environments and error accumulation require strategies to be revised at a different temporal scale than action execution. Experiments on eight state-of-the-art LLMs across progressively challenging environments show that our framework improves operational stability and efficiency compared to other baselines. However, performance degrades substantially as task complexity increases, revealing fundamental limitations in current LLMs for long-horizon, multi-factor decision-making.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!