2602.02497v1 Jan 14, 2026 cs.CL

STEMVerse: 대규모 언어 모델의 STEM 추론 능력을 진단하기 위한 이중 축 프레임워크

STEMVerse: A Dual-Axis Diagnostic Framework for STEM Reasoning in Large Language Models

Xuchen Li

Citations: 11

h-index: 1

Xuzhao Li

Citations: 78

h-index: 5

Jian Zhao

Citations: 15

h-index: 1

Shiyu Hu

Citations: 271

h-index: 9

대규모 언어 모델(LLM)이 복잡한 추론 과제에서 상당한 발전을 이루면서, 과학, 기술, 공학, 수학(STEM) 분야에서의 LLM의 능력을 평가하는 것은 기계 지능을 측정하는 주요 방법이 되었습니다. 그러나 현재의 평가 방식은 종종 벤치마크를 독립적인 '영역'으로 취급하며, 학문적 전문성과 인지 깊이의 복잡성을 간과하는 단일화된 종합 점수만을 제공합니다. 이러한 결과 중심적인 접근 방식은 모델 오류가 부족한 전문 지식에서 비롯된 것인지, 아니면 인지 능력의 결핍에서 비롯된 것인지 구별하지 못하여 진단적 가치를 제한합니다. 이를 해결하기 위해, 우리는 LLM의 STEM 추론 능력을 체계적으로 분석하도록 설계된 진단 프레임워크인 STEMVerse를 제안합니다. 이 프레임워크는 학문적 전문성과 인지 복잡성을 기준으로 모델 성능을 특성화하여 추론에 필요한 능력을 파악합니다. 우리는 20,000개 이상의 STEM 문제를 주요 벤치마크에서 수집하여 통합된 "학문 분야 $ imes$ 인지 능력" 공간으로 재구성하고, 각 문제에 이중 축 레이블을 부여합니다. 이 통합된 진단 프레임워크를 사용하여 다양한 매개변수 크기와 학습 패러다임을 가진 대표적인 LLM 패밀리를 체계적으로 평가했습니다. 우리의 실증적 결과는 STEM 추론에서 나타나는 구조적인 오류 패턴을 보여줍니다. STEMVerse는 다학문적 범위를 포괄하고, 미세한 인지 계층화를 통합하여 단일 프레임워크로 제공함으로써, LLM의 과학적 추론 특성을 이해하는 데 명확하고 실행 가능한 관점을 제공합니다.

Original Abstract

As Large Language Models (LLMs) achieve significant breakthroughs in complex reasoning tasks, evaluating their proficiency in science, technology, engineering, and mathematics (STEM) has become a primary method for measuring machine intelligence. However, current evaluation paradigms often treat benchmarks as isolated "silos," offering only monolithic aggregate scores that neglect the intricacies of both academic specialization and cognitive depth. This result-oriented approach fails to distinguish whether model errors stem from insufficient domain knowledge or deficiencies in cognitive capacity, thereby limiting the diagnostic value. To address this, we propose STEMVerse, a diagnostic framework designed to systematically analyze the STEM reasoning capabilities of LLMs. This framework characterizes model performance across academic specialization and cognitive complexity to map the capability required for reasoning. We re-aggregate over 20,000 STEM problems from mainstream benchmarks into a unified "Discipline $\times$ Cognition" capability space, assigning dual-axis labels to every instance. Utilizing this unified diagnostic framework, we systematically evaluate representative LLM families across varying parameter scales and training paradigms. Our empirical results reveal structural failure patterns in STEM reasoning. By integrating multi-disciplinary coverage and fine-grained cognitive stratification into a unified framework, STEMVerse provides a clear and actionable perspective for understanding the scientific reasoning characteristics of LLMs.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!