2606.05806v1 Jun 04, 2026 cs.AI

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

Shuaiqiang Wang
Shuaiqiang Wang
Citations: 2,668
h-index: 21
Lingyong Yan
Lingyong Yan
Baidu Inc.
Citations: 1,503
h-index: 17
Xiang Li
Xiang Li
Citations: 19
h-index: 2
Yucheng Shen
Yucheng Shen
Citations: 47
h-index: 3
Dongsheng Zhu
Dongsheng Zhu
Citations: 94
h-index: 3
Dawei Yin
Dawei Yin
Citations: 106
h-index: 5
Xucheng Ma
Xucheng Ma
Citations: 0
h-index: 0
Yukun Zhao
Yukun Zhao
Citations: 96
h-index: 3

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

0 Citations
0 Influential
35.993061443341 Altmetric
180.0 Score
Original PDF
2

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!