2601.17399v1 Jan 24, 2026 cs.CV

ReLE: 중국어 LLM의 능력 불균형 진단을 위한 확장 가능한 시스템 및 구조화된 벤치마크

ReLE: A Scalable System and Structured Benchmark for Diagnosing Capability Anisotropy in Chinese LLMs

Bin Hu

Citations: 184

h-index: 5

Wei Chen

Citations: 2,110

h-index: 4

R. Fang

Citations: 0

h-index: 0

Jian Li

Citations: 41

h-index: 3

Ying-Cong Chen

Citations: 184

h-index: 5

Xin Tang

Citations: 28

h-index: 3

Liang Diao

Citations: 133

h-index: 2

대규모 언어 모델(LLM)은 중국어 이해 능력에서 빠른 발전을 이루었지만, 정확하게 평가하는 데에는 벤치마크의 포화 현상과 엄청난 계산 비용이라는 어려움이 존재합니다. 정적인 순위표는 모델의 순위를 보여주지만, 종종 능력 간의 구조적인 상호작용을 가립니다. 본 연구에서는 능력 불균형(모델 성능의 영역별 불균일성)을 진단하기 위해 설계된 확장 가능한 시스템인 ReLE (Robust Efficient Live Evaluation)을 제시합니다. ReLE를 사용하여 304개의 모델(상용 모델 189개, 오픈 소스 모델 115개)을 207,843개의 샘플로 구성된 영역 $ imes$ 능력 직교 행렬에 따라 평가했습니다. 현재 평가의 문제점을 해결하기 위해 다음과 같은 두 가지 방법론적 기여를 제시합니다. (1) 추론 작업에서 임베딩 기반의 오탐을 제거하는 기호-기반 하이브리드 점수 산정 메커니즘; (2) 노이즈 보정을 포함한 네이만 할당을 기반으로 한 동적 분산 인지 스케줄러는 전체 평가에 비해 계산 비용을 70% 줄이면서 $ρ=0.96$의 순위 상관관계를 유지합니다. 분석 결과, 집계 순위는 가중치 방식에 매우 민감하며, ReLE에서 모델은 약 11.4의 순위 안정성 지수(Rank Stability Amplitude, RSA)를 보이는 반면, 기존 벤치마크에서는 약 5.0을 보입니다. 이는 현대 모델이 일반적으로 우수한 것보다 특정 영역에 특화되어 있음을 확인시켜 줍니다. ReLE는 포괄적인 정적 벤치마크를 대체하는 것이 아니라, 변화하는 모델 환경을 위한 고빈도 진단 모니터링 시스템입니다.

Original Abstract

Large Language Models (LLMs) have achieved rapid progress in Chinese language understanding, yet accurately evaluating their capabilities remains challenged by benchmark saturation and prohibitive computational costs. While static leaderboards provide snapshot rankings, they often mask the structural trade-offs between capabilities. In this work, we present ReLE (Robust Efficient Live Evaluation), a scalable system designed to diagnose Capability Anisotropy, the non-uniformity of model performance across domains. Using ReLE, we evaluate 304 models (189 commercial, 115 open-source) across a Domain $\times$ Capability orthogonal matrix comprising 207,843 samples. We introduce two methodological contributions to address current evaluation pitfalls: (1) A Symbolic-Grounded Hybrid Scoring Mechanism that eliminates embedding-based false positives in reasoning tasks; (2) A Dynamic Variance-Aware Scheduler based on Neyman allocation with noise correction, which reduces compute costs by 70\% compared to full-pass evaluations while maintaining a ranking correlation of $ρ=0.96$. Our analysis reveals that aggregate rankings are highly sensitive to weighting schemes: models exhibit a Rank Stability Amplitude (RSA) of 11.4 in ReLE versus $\sim$5.0 in traditional benchmarks, confirming that modern models are highly specialized rather than generally superior. We position ReLE not as a replacement for comprehensive static benchmarks, but as a high-frequency diagnostic monitor for the evolving model landscape.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!