2602.12143v1 Feb 12, 2026 cs.AI

STAR: 대규모 모델 성능 예측을 위한 통계적 추론과 에이전트 추론의 결합

STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction

Junying Wang

Citations: 99

h-index: 4

Chunyi Li

Citations: 2,496

h-index: 24

Xiaohong Liu

Citations: 1,835

h-index: 26

Guangtao Zhai

Citations: 83

h-index: 5

Xiaoxia Wang

Citations: 23

h-index: 2

Chunxiao Li

Citations: 34

h-index: 3

Yijin Guo

Citations: 126

h-index: 6

Zijian Chen

Citations: 445

h-index: 12

Zicheng Zhang

Citations: 4,704

h-index: 34

포괄적인 대규모 모델 평가에 막대한 비용이 소요됨에 따라, 제한된 관찰만으로 모델 성능을 예측하는 것이 필수적이 되었다. 그러나 기존 통계적 방법들은 패턴 변화, 데이터 희소성, 설명력 부족으로 어려움을 겪고 있으며, 순수 LLM 기반 방법들은 여전히 신뢰도가 낮다. 이에 우리는 데이터 기반의 통계적 기대치와 지식 기반의 에이전트 추론을 연결하는 프레임워크인 STAR를 제안한다. STAR는 전문화된 검색기를 활용해 외부 지식을 수집하고 의미론적 특징을 제약 조건이 있는 확률적 행렬 분해(CPMF)에 임베딩하여, 불확실성을 포함한 통계적 기대치를 생성한다. 이어 기대 위반 이론(EVT)에 기반한 추론 모듈이 계열 내 분석, 모델 간 비교, 신뢰도 인식 집계를 통해 예측을 정교화하며, 추적 가능한 설명이 포함된 조정 결과를 산출한다. 광범위한 실험을 통해 STAR가 점수 기반 및 순위 기반 지표 모두에서 모든 기준 모델을 일관되게 능가함을 확인했다. 특히 테스트 모델당 관찰된 점수가 1~2개에 불과한 극도의 희소성 상황에서도 가장 강력한 통계적 방법 대비 총점에서 14.46%의 성능 향상을 달성했다.

Original Abstract

As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning. STAR leverages specialized retrievers to gather external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module guided by Expectation Violation Theory (EVT) then refines predictions through intra-family analysis, cross-model comparison, and credibility-aware aggregation, producing adjustments with traceable explanations. Extensive experiments show that STAR consistently outperforms all baselines on both score-based and rank-based metrics, delivering a 14.46% gain in total score over the strongest statistical method under extreme sparsity, with only 1--2 observed scores per test model.

2 Citations

0 Influential

17 Altmetric

87.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!