2601.07206v1 Jan 12, 2026 cs.AI

LLMRouterBench: LLM 라우팅을 위한 대규모 벤치마크 및 통합 프레임워크

LLMRouterBench: A Massive Benchmark and Unified Framework for LLM Routing

Shengji Tang

Citations: 100

h-index: 5

Hao Li

Citations: 75

h-index: 5

Shuyue Hu

Citations: 136

h-index: 7

Peng Ye

Citations: 63

h-index: 5

Yiqun Zhang

Citations: 44

h-index: 4

Zhaoyan Guo

Citations: 5

h-index: 1

Chenxu Wang

Citations: 83

h-index: 5

Qiaosheng Zhang

Citations: 40

h-index: 4

Yang Chen

Citations: 26

h-index: 2

Lei Bai

Citations: 27

h-index: 3

Zhen Wang

Citations: 36

h-index: 2

Biqing Qi

Citations: 5

h-index: 1

대규모 언어 모델(LLM) 라우팅은 각 쿼리를 앙상블 중에서 가장 적합한 모델에 할당하는 기술입니다. 본 논문에서는 LLM 라우팅을 위한 대규모 벤치마크이자 통합 프레임워크인 LLMRouterBench를 소개합니다. 이 벤치마크는 21개 데이터셋과 33개 모델에서 수집된 40만 개 이상의 인스턴스로 구성됩니다. 또한 성능 중심 라우팅과 성능-비용 트레이드오프 라우팅 모두를 위한 포괄적인 지표를 제공하며, 10가지의 대표적인 라우팅 베이스라인을 통합하고 있습니다. 우리는 LLMRouterBench를 사용하여 해당 분야를 체계적으로 재평가했습니다. LLM 라우팅의 핵심 전제인 강력한 모델 상호보완성을 확인하는 한편, 통합 평가 환경에서 많은 라우팅 방법론들이 유사한 성능을 보이고 있으며, 상용 라우터를 포함한 일부 최신 접근 방식들이 단순한 베이스라인조차 안정적으로 능가하지 못한다는 사실을 발견했습니다. 한편, 주로 지속적인 모델 리콜(recall) 실패로 인해 오라클(Oracle) 성능과는 여전히 상당한 격차가 존재합니다. 더 나아가 백본 임베딩 모델이 미치는 영향이 제한적이라는 점, 신중하게 모델을 선별하는 것에 비해 앙상블의 크기를 단순히 키우는 것은 수확 체감(diminishing returns)을 보인다는 점, 그리고 해당 벤치마크가 지연 시간(latency)을 고려한 분석 또한 가능하게 함을 보여줍니다. 모든 코드와 데이터는 https://github.com/ynulihao/LLMRouterBench 에서 확인 가능합니다.

Original Abstract

Large language model (LLM) routing assigns each query to the most suitable model from an ensemble. We introduce LLMRouterBench, a large-scale benchmark and unified framework for LLM routing. It comprises over 400K instances from 21 datasets and 33 models. Moreover, it provides comprehensive metrics for both performance-oriented routing and performance-cost trade-off routing, and integrates 10 representative routing baselines. Using LLMRouterBench, we systematically re-evaluate the field. While confirming strong model complementarity-the central premise of LLM routing-we find that many routing methods exhibit similar performance under unified evaluation, and several recent approaches, including commercial routers, fail to reliably outperform a simple baseline. Meanwhile, a substantial gap remains to the Oracle, driven primarily by persistent model-recall failures. We further show that backbone embedding models have limited impact, that larger ensembles exhibit diminishing returns compared to careful model curation, and that the benchmark also enables latency-aware analysis. All code and data are available at https://github.com/ynulihao/LLMRouterBench.

5 Citations

0 Influential

43.887687219529 Altmetric

224.4 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!