2602.12424v1 Feb 12, 2026 cs.CL

RankLLM: 질문 난이도 정량화를 통한 LLM의 가중치 기반 순위 산정

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

Yue Huang

Citations: 2,201

h-index: 18

Ziqi Zhang

Citations: 4

h-index: 1

Xingjian Hu

Citations: 3

h-index: 1

Kai Zhang

Citations: 17

h-index: 3

Yixin Liu

Citations: 194

h-index: 5

Qingsong Wen

Citations: 123

h-index: 4

Xiangliang Zhang

Citations: 870

h-index: 14

Neil Zhenqiang Gong

Citations: 1,308

h-index: 12

Lichao Sun

Citations: 412

h-index: 4

Ruoxi Chen

Citations: 1,106

h-index: 8

Kaidi Xu

Citations: 34

h-index: 2

벤치마크는 대형 언어 모델(LLM)의 성능을 체계적으로 평가하기 위한 표준화된 평가 프레임워크를 구축하여 객관적인 비교를 촉진하고 관련 분야의 발전을 주도한다. 그러나 기존의 벤치마크들은 질문의 난이도를 차별화하지 못해 모델의 역량을 효과적으로 구별하는 데 한계가 있다. 이러한 한계를 해결하기 위해 본 논문에서는 질문의 난이도와 모델의 역량을 모두 정량화하도록 설계된 새로운 프레임워크인 RankLLM을 제안한다. RankLLM은 난이도를 변별력을 위한 주요 기준으로 도입하여 LLM의 역량을 보다 세밀하게 평가할 수 있도록 한다. RankLLM의 핵심 메커니즘은 모델과 질문 간의 양방향 점수 전파를 촉진한다. RankLLM의 핵심 직관은 모델이 질문에 올바르게 답하면 역량 점수를 획득하는 반면, 질문이 모델에게 어려움을 줄 때 해당 질문의 난이도 점수가 상승한다는 것이다. 이 프레임워크를 사용하여 우리는 여러 도메인에 걸친 35,550개의 질문에 대해 30개의 모델을 평가한다. RankLLM은 인간의 판단과 90%의 일치율을 달성했으며 IRT(문항반응이론)와 같은 강력한 베이스라인의 성능을 일관되게 능가한다. 또한 강력한 안정성, 빠른 수렴성 및 높은 계산 효율성을 보여주어, 난이도를 고려한 대규모 LLM 평가를 위한 실용적인 솔루션이 된다.

Original Abstract

Benchmarks establish a standardized evaluation framework to systematically assess the performance of large language models (LLMs), facilitating objective comparisons and driving advancements in the field. However, existing benchmarks fail to differentiate question difficulty, limiting their ability to effectively distinguish models' capabilities. To address this limitation, we propose RankLLM, a novel framework designed to quantify both question difficulty and model competency. RankLLM introduces difficulty as the primary criterion for differentiation, enabling a more fine-grained evaluation of LLM capabilities. RankLLM's core mechanism facilitates bidirectional score propagation between models and questions. The core intuition of RankLLM is that a model earns a competency score when it correctly answers a question, while a question's difficulty score increases when it challenges a model. Using this framework, we evaluate 30 models on 35,550 questions across multiple domains. RankLLM achieves 90% agreement with human judgments and consistently outperforms strong baselines such as IRT. It also exhibits strong stability, fast convergence, and high computational efficiency, making it a practical solution for large-scale, difficulty-aware LLM evaluation.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!