2602.11348v2 Feb 11, 2026 cs.AI

AgentNoiseBench: 노이즈 환경에서 도구 사용 LLM 에이전트의 강건성 평가

AgentNoiseBench: Benchmarking Robustness of Tool-Using LLM Agents Under Noisy Condition

Xunliang Cai

Citations: 15

h-index: 2

Hui Su

Citations: 79

h-index: 3

An Zhang

Citations: 131

h-index: 6

Xiang Wang

Citations: 715

h-index: 14

Ruipeng Wang

Citations: 100

h-index: 5

Yuxin Chen

Citations: 4

h-index: 1

Yukai Wang

Citations: 38

h-index: 2

Chang Wu

Citations: 84

h-index: 3

Xiaodong Cai

Citations: 44

h-index: 2

Qi Gu

Citations: 24

h-index: 3

Tat-Seng Chua

Citations: 5

h-index: 1

Junfeng Fang

Citations: 571

h-index: 13

최근 대규모 언어 모델(LLM)의 발전으로 LLM 기반 에이전트는 다양한 벤치마크에서 뛰어난 성능을 보여주고 있습니다. 그러나 실제 환경에서의 성능은 벤치마크 환경에서 관찰되는 성능과 차이를 보이는 경우가 많으며, 특히 복잡하고 불완전한 환경에서 이러한 차이가 두드러집니다. 이러한 차이는 주로 현재의 훈련 및 평가 패러다임이 이상적인 가정에 기반하고 있으며, 실제 상호 작용에서 발생하는 고유한 확률적 요소와 노이즈를 간과하기 때문입니다. 이러한 격차를 해소하기 위해, 우리는 노이즈 환경에서 에이전트 모델의 강건성을 체계적으로 평가하는 프레임워크인 AgentNoiseBench를 소개합니다. 먼저, 실제 시나리오에서의 편향과 불확실성에 대한 심층적인 분석을 수행하고, 환경 노이즈를 두 가지 주요 유형, 즉 사용자 노이즈와 도구 노이즈로 분류합니다. 이러한 분석을 바탕으로, 기존의 에이전트 중심 벤치마크에 제어 가능한 노이즈를 주입하는 자동화된 파이프라인을 개발했습니다. 이 파이프라인을 활용하여, 다양한 아키텍처와 매개변수 크기를 가진 다양한 모델에 대한 광범위한 평가를 수행했습니다. 우리의 결과는 다양한 노이즈 조건에서 일관된 성능 변화를 보여주며, 현재 에이전트 모델이 현실적인 환경 변화에 얼마나 민감한지를 강조합니다.

Original Abstract

Recent advances in large language models have enabled LLM-based agents to achieve strong performance on a variety of benchmarks. However, their performance in real-world deployments often that observed on benchmark settings, especially in complex and imperfect environments. This discrepancy largely arises because prevailing training and evaluation paradigms are typically built on idealized assumptions, overlooking the inherent stochasticity and noise present in real-world interactions. To bridge this gap, we introduce AgentNoiseBench, a framework for systematically evaluating the robustness of agentic models under noisy environments. We first conduct an in-depth analysis of biases and uncertainties in real-world scenarios and categorize environmental noise into two primary types: user-noise and tool-noise. Building on this analysis, we develop an automated pipeline that injects controllable noise into existing agent-centric benchmarks while preserving task solvability. Leveraging this pipeline, we perform extensive evaluations across a wide range of models with diverse architectures and parameter scales. Our results reveal consistent performance variations under different noise conditions, highlighting the sensitivity of current agentic models to realistic environmental perturbations.

1 Citations

0 Influential

7 Altmetric

36.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!