2606.05806v1 Jun 04, 2026 cs.AI

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Shuaiqiang Wang

Citations: 2,668

h-index: 21

Lingyong Yan

Baidu Inc.

Citations: 1,503

h-index: 17

Xiang Li

Citations: 19

h-index: 2

Yucheng Shen

Citations: 47

h-index: 3

Dongsheng Zhu

Citations: 94

h-index: 3

Dawei Yin

Citations: 106

h-index: 5

Xucheng Ma

Citations: 0

h-index: 0

Yukun Zhao

Citations: 96

h-index: 3

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

0 Citations

0 Influential

35.993061443341 Altmetric

180.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!