2602.14257v1 Feb 15, 2026 cs.CL

AD-Bench: LLM 에이전트를 위한 실제 환경 기반의 트래jectory-인식 광고 분석 벤치마크

AD-Bench: A Real-World, Trajectory-Aware Advertising Analytics Benchmark for LLM Agents

Huan Yu

Citations: 41

h-index: 2

Ling Hu

Citations: 7

h-index: 2

Yiding Sun

Citations: 48

h-index: 4

Tianle Xia

Citations: 28

h-index: 2

Wenwei Li

Citations: 45

h-index: 4

Ming Xu

Citations: 215

h-index: 4

Liqun Liu

Citations: 7

h-index: 2

Peng Shu

Citations: 4

h-index: 1

Jie Jiang

Citations: 21

h-index: 2

대규모 언어 모델(LLM) 에이전트는 복잡한 추론 작업에서 놀라운 발전을 이루었지만, 실제 환경에서의 성능 평가가 중요한 과제로 부상했습니다. 현재 벤치마크는 이상적인 시뮬레이션에 주로 제한되어 있으며, 광고 및 마케팅 분석과 같은 전문 분야의 실제 요구 사항을 충족하지 못합니다. 이러한 분야에서는 작업이 본질적으로 더 복잡하며, 종종 전문적인 마케팅 도구와의 다중 라운드 상호 작용이 필요합니다. 이러한 격차를 해소하기 위해, 실제 광고 및 마케팅 플랫폼의 비즈니스 요구 사항을 기반으로 설계된 벤치마크인 AD-Bench를 제안합니다. AD-Bench는 실제 사용자의 마케팅 분석 요청으로 구성되며, 해당 분야 전문가가 검증 가능한 참조 답변과 관련된 참조 도구 호출 트래jectory를 제공합니다. 이 벤치마크는 요청을 세 가지 난이도 수준(L1-L3)으로 분류하여 에이전트의 다중 라운드, 다중 도구 협업 능력을 평가합니다. 실험 결과, Gemini-3-Pro는 AD-Bench에서 Pass@1 = 68.0% 및 Pass@3 = 83.0%를 달성했지만, L3에서는 Pass@1 = 49.4% 및 Pass@3 = 62.1%로 성능이 크게 저하되었으며, 트래jectory 커버리지는 70.1%였습니다. 이는 최첨단 모델조차도 복잡한 광고 및 마케팅 분석 시나리오에서 상당한 성능 격차를 가지고 있음을 나타냅니다. AD-Bench는 광고 및 마케팅 에이전트를 평가하고 개선하기 위한 현실적인 벤치마크를 제공하며, 리더보드 및 코드는 https://github.com/Emanual20/adbench-leaderboard에서 확인할 수 있습니다.

Original Abstract

While Large Language Model (LLM) agents have achieved remarkable progress in complex reasoning tasks, evaluating their performance in real-world environments has become a critical problem. Current benchmarks, however, are largely restricted to idealized simulations, failing to address the practical demands of specialized domains like advertising and marketing analytics. In these fields, tasks are inherently more complex, often requiring multi-round interaction with professional marketing tools. To address this gap, we propose AD-Bench, a benchmark designed based on real-world business requirements of advertising and marketing platforms. AD-Bench is constructed from real user marketing analysis requests, with domain experts providing verifiable reference answers and corresponding reference tool-call trajectories. The benchmark categorizes requests into three difficulty levels (L1-L3) to evaluate agents' capabilities under multi-round, multi-tool collaboration. Experiments show that on AD-Bench, Gemini-3-Pro achieves Pass@1 = 68.0% and Pass@3 = 83.0%, but performance drops significantly on L3 to Pass@1 = 49.4% and Pass@3 = 62.1%, with a trajectory coverage of 70.1%, indicating that even state-of-the-art models still exhibit substantial capability gaps in complex advertising and marketing analysis scenarios. AD-Bench provides a realistic benchmark for evaluating and improving advertising marketing agents, the leaderboard and code can be found at https://github.com/Emanual20/adbench-leaderboard.

2 Citations

0 Influential

22 Altmetric

112.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!