2605.07096v1 May 08, 2026 cs.LG

캐시된 응답을 활용한 효율적인 모델 평가

Query-efficient model evaluation using cached responses

Hayden S. Helm

Citations: 261

h-index: 10

Ben Johnson

Citations: 2

h-index: 1

Carey E. Priebe

Citations: 1,041

h-index: 15

새로운 모델을 배포하기 전에, 기존 벤치마크를 사용하여 모델의 성능을 평가하는 것은 필수적입니다. 현대적인 평가 프레임워크에서 모든 쿼리에 대한 응답을 생성하고 평가하는 것은 비용이 많이 들 수 있습니다. 실제로, 이전에 평가된 모델의 응답은 종종 캐시되는데, 이 추가 정보를 활용하여 새로운 모델을 정확하게 평가하는 데 필요한 쿼리 수를 줄일 수 있는 잠재적인 기회를 제공합니다. 본 논문에서는 데이터 커널 퍼스펙티브 스페이스(DKPS)를 기반으로 캐시된 모델 응답을 활용하여 벤치마크 성능을 예측하는 방법을 소개합니다. DKPS는 블랙박스 환경에서 모델 간의 관계를 정량화하는 방법입니다. 이론적으로, 특정 조건 하에서 DKPS 기반 방법은 쿼리 효율성이 높다는 것을 보여줍니다. 실험적으로, DKPS 기반 방법이 기준 모델과 동일한 평균 절대 오차를 달성하면서도 훨씬 적은 쿼리 예산으로 성능을 얻을 수 있음을 보여줍니다. 마지막으로, 참조 모델에 대한 적합도를 극대화하여 예측 정확도를 향상시키는 쿼리 집합을 선택하는 오프라인 방법을 제안합니다.

Original Abstract

Evaluating a new model on an existing benchmark is often necessary to understand its behavior before deployment. For modern evaluation frameworks, generating and evaluating a response for all queries can be prohibitively expensive. In practice, responses from previously-evaluated models are often cached -- creating a potential opportunity to use this additional information to decrease the number of queries required to accurately evaluate a new model. In this paper, we introduce an approach for predicting benchmark performance that leverages cached model responses based on the Data Kernel Perspective Space (DKPS), a method for quantifying the relationship between models in the black-box setting. Theoretically, we show that DKPS-based methods are query-efficient under certain conditions. Empirically, we demonstrate that DKPS-based methods achieve the same mean absolute error as baselines with a substantially decreased query budget. We conclude by proposing an offline method for selecting a set of queries that maximizes the goodness-of-fit on reference models, improving prediction accuracy over random query selection.

2 Citations

0 Influential

7.5 Altmetric

39.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!