2604.06111v1 Apr 07, 2026 cs.AI

ACE-Bench: 가벼운 환경에서 확장 가능한 범위와 제어 가능한 난이도를 갖춘, 에이전트 구성 가능한 평가 벤치마크

ACE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

Yulin Zhou

Citations: 4

h-index: 1

Chuan Ma

Citations: 5

h-index: 1

Debargha Ganguly

Citations: 39

h-index: 3

Wang Yang

Citations: 56

h-index: 3

Shouren Wang

Case Western Reserve University

Citations: 2

h-index: 1

Chaoda Song

Citations: 8

h-index: 2

Xinpeng Li

Citations: 53

h-index: 3

Vipin Chaudhary

Citations: 116

h-index: 5

Xiaotian Han

Citations: 95

h-index: 5

Zhihao Dou

Citations: 70

h-index: 4

기존 에이전트 벤치마크는 두 가지 중요한 한계점을 가지고 있습니다. 첫째, 환경과의 상호 작용에 따른 높은 오버헤드(총 평가 시간의 최대 41%)가 발생하며, 둘째, 작업 범위와 난이도 분포의 불균형으로 인해 종합 점수가 신뢰성이 떨어집니다. 이러한 문제를 해결하기 위해, 우리는 ACE-Bench를 제안합니다. ACE-Bench는 통일된 그리드 기반 계획 작업을 기반으로 하며, 에이전트는 부분적으로 완성된 일정에서 숨겨진 슬롯을 채워야 합니다. 이 때, 에이전트는 로컬 슬롯 제약 조건과 글로벌 제약 조건을 모두 만족해야 합니다. ACE-Bench는 두 가지 직교적인 축을 통해 세밀한 제어가 가능합니다. 첫 번째는 숨겨진 슬롯의 개수 $H$에 의해 제어되는 확장 가능한 범위(Scalable Horizons)이고, 두 번째는 글로벌 오도(decoy) 후보의 개수를 결정하는 오도 예산 $B$에 의해 제어되는 제어 가능한 난이도(Controllable Difficulty)입니다. 중요한 점은, 모든 도구 호출이 가벼운 환경(Lightweight Environment) 설계를 통해 정적 JSON 파일을 통해 처리되므로, 설정 오버헤드를 없애고 빠르고 재현 가능한 평가를 가능하게 하여 훈련 시간 검증에 적합합니다. 우리는 먼저 H와 B가 작업 범위와 난이도에 대해 신뢰성 있는 제어를 제공하며, ACE-Bench가 강력한 도메인 일관성과 모델 구별력을 갖는다는 것을 검증했습니다. 그런 다음, 6개의 도메인에서 다양한 크기와 계열의 13개 모델에 대한 종합적인 실험을 수행하여 모델 간 성능의 상당한 차이를 보여주었으며, ACE-Bench가 에이전트 추론에 대한 해석 가능하고 제어 가능한 평가를 제공한다는 것을 확인했습니다.

Original Abstract

Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: Scalable Horizons, controlled by the number of hidden slots $H$, and Controllable Difficulty, governed by a decoy budget $B$ that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a Lightweight Environment design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that H and B provide reliable control over task horizon and difficulty, and that ACE-Bench exhibits strong domain consistency and model discriminability. We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that ACE-Bench provides interpretable and controllable evaluation of agent reasoning.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!