2602.11877v1 Feb 12, 2026 cs.CL

협업 LLM 시스템에서 라우터의 공정하고 포괄적인 평가를 향하여

Towards Fair and Comprehensive Evaluation of Routers in Collaborative LLM Systems

Yixia Li

Southern University of Science and Technology

Citations: 141

h-index: 6

Bingyi Jing

Citations: 31

h-index: 4

Wanxin Wu

Citations: 41

h-index: 3

He Zhu

Citations: 57

h-index: 4

Jie Zhao

Citations: 168

h-index: 3

Guanhua Chen

Citations: 16

h-index: 1

Jian Yang

Citations: 154

h-index: 2

Lei Yang

Citations: 174

h-index: 8

Benyou Wang

Citations: 6

h-index: 2

Hongru Wang

Citations: 379

h-index: 6

대규모 언어 모델(LLM)은 큰 성공을 거두었지만, 비용 및 개인정보 보호 제약으로 인해 더 작은 모델을 로컬에 배포하고 복잡한 질의는 클라우드 기반 모델로 오프로딩해야 할 필요성이 대두되고 있습니다. 기존의 라우터 평가는 비체계적이며, 시나리오별 요구 사항과 분포 외(out-of-distribution) 견고성을 간과하고 있습니다. 우리는 라우터 성능, 시나리오 정합성, 교차 도메인 견고성이라는 세 가지 차원을 갖춘 원칙적인 평가 프레임워크인 RouterXBench를 제안합니다. 출력 확률이나 외부 임베딩에 의존하는 기존 연구와 달리, 우리는 답변 생성 전에 모델의 불확실성을 포착하는 내부 은닉 상태(hidden states)를 활용합니다. 또한, 확률적 학습을 사용해 학습 가능한 디리클레(Dirichlet) 분포를 통해 교차 계층(cross-layer) 은닉 상태를 집계하는 경량 라우터인 ProbeDirichlet을 소개합니다. 다중 도메인 데이터로 학습된 이 모델은 도메인 내(in-domain) 및 분포 외 시나리오 전반에서 견고하게 일반화됩니다. 실험 결과, ProbeDirichlet은 라우터 성능 및 고정밀 시나리오에서 최고 성능의 베이스라인 대비 각각 16.68%와 18.86%의 상대적 성능 향상을 달성했으며, 다양한 모델 제품군, 모델 규모, 이기종 작업 및 에이전트 워크플로우 전반에 걸쳐 일관된 성능을 보여주었습니다.

Original Abstract

Large language models (LLMs) have achieved success, but cost and privacy constraints necessitate deploying smaller models locally while offloading complex queries to cloud-based models. Existing router evaluations are unsystematic, overlooking scenario-specific requirements and out-of-distribution robustness. We propose RouterXBench, a principled evaluation framework with three dimensions: router ability, scenario alignment, and cross-domain robustness. Unlike prior work that relies on output probabilities or external embeddings, we utilize internal hidden states that capture model uncertainty before answer generation. We introduce ProbeDirichlet, a lightweight router that aggregates cross-layer hidden states via learnable Dirichlet distributions with probabilistic training. Trained on multi-domain data, it generalizes robustly across in-domain and out-of-distribution scenarios. Our results show ProbeDirichlet achieves 16.68% and 18.86% relative improvements over the best baselines in router ability and high-accuracy scenarios, with consistent performance across model families, model scales, heterogeneous tasks, and agentic workflows.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!