2603.13426v1 Mar 13, 2026 cs.LG

의사결정 기반 도구 선택: 지연 시간 제약 환경에서의 LLM 추론을 위한 방법론 (LLM 추론에서 LLM 추론을 위한 의미 기반 라우터)

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Xue Liu

Citations: 43

h-index: 3

Huamin Chen

Citations: 31

h-index: 3

Xunzhuo Liu

Citations: 31

h-index: 3

Junchen Jiang

Citations: 486

h-index: 7

Bowei He

Citations: 173

h-index: 4

LLM 추론 게이트웨이에서 의미 기반 라우터는 중요한 요청 경로에서 도구를 선택하는데, 이때 추가되는 1밀리초의 지연 시간은 수백만 건의 요청에 걸쳐 누적됩니다. 본 연구에서는 '결과 인식 도구 선택(Outcome-Aware Tool Selection, OATS)'이라는 새로운 방법을 제안합니다. OATS는 도구 임베딩을 과거 성공적인 쿼리의 중심 방향으로 보간하는 방식으로 작동하며, 이는 파라미터, 지연 시간 또는 GPU 비용을 추가하지 않는 오프라인 과정입니다. MetaTool 데이터셋(199개 도구, 4,287개 쿼리)에서 NDCG@5 지표가 0.869에서 0.940으로 향상되었고, ToolBench 데이터셋(2,413개 API)에서는 0.834에서 0.848로 향상되었습니다. 또한, 2,625개의 파라미터를 가진 MLP 재순위 모델과 197K개의 파라미터를 가진 대조 학습 어댑터라는 두 가지 학습 기반 확장 모델을 평가했습니다. 도구 세트에 비해 결과 데이터가 부족한 경우, MLP 재순위 모델은 성능 저하를 보이거나 기본 모델과 비슷한 성능을 보입니다. 반면, 대조 학습 어댑터는 MetaTool 데이터셋에서 NDCG@5 지표를 0.931로 향상시키는 데 기여했습니다. 모든 방법은 동일한 30%의 테스트 데이터셋으로 평가되었습니다. 실질적인 시사점은 먼저 추가 비용이 없는 기본 방법을 적용하고, 데이터 밀도가 충분할 때만 학습 기반 요소를 추가하는 것입니다. 모든 방법은 단일 자릿수 밀리초 이내의 CPU 리소스를 사용합니다.

Original Abstract

Semantic routers in LLM inference gateways select tools in the critical request path, where every millisecond of added latency compounds across millions of requests. We propose Outcome-Aware Tool Selection (OATS), which interpolates tool embeddings toward the centroid of queries where they historically succeed -- an offline process that adds no parameters, latency, or GPU cost at serving time. On MetaTool (199~tools, 4,287~queries), this improves NDCG@5 from 0.869 to 0.940; on ToolBench (2,413~APIs), from 0.834 to 0.848. We also evaluate two learned extensions: a 2,625-parameter MLP re-ranker and a 197K-parameter contrastive adapter. The MLP re-ranker hurts or matches baseline when outcome data is sparse relative to the tool set; the contrastive adapter provides comparable gains on MetaTool (NDCG@5: 0.931). All methods are evaluated on the same held-out 30\% test split. The practical takeaway is to start with the zero-cost refinement and add learned components only when data density warrants it. All mechanisms run within single-digit millisecond CPU budgets.

1 Citations

0 Influential

3.5 Altmetric

18.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!