2605.06444v1 May 07, 2026 cs.AI

SCRuB: 명세 기반 평가를 통한 사회적 개념 추론

SCRuB: Social Concept Reasoning under Rubric-Based Evaluation

Ma-hesh Pasupuleti

Citations: 16,412

h-index: 9

Skyler Wang

Citations: 2,967

h-index: 12

Arjun Subramonian

Meta FAIR

Citations: 4,170

h-index: 15

Anaelia Ovalle

Citations: 641

h-index: 10

Jamelle Watson-Daniels

Citations: 241

h-index: 9

Himaghna Bhattacharjee

Citations: 130

h-index: 5

Brandon Handoko

Citations: 1

h-index: 1

Candace Ross

Citations: 647

h-index: 13

Vidya Sarma

Citations: 34

h-index: 2

Karen Ullrich

Citations: 79

h-index: 5

Will van der Vaart

Citations: 1

h-index: 1

Yijing Xin

Citations: 6

h-index: 1

Maximilian Nickel

Citations: 16,139

h-index: 32

Antonio Li

Citations: 1

h-index: 1

대규모 언어 모델(LLM)의 추론 능력에 대한 많은 연구가 수학적 또는 기술적 과제에 초점을 맞추는 반면, 사회적 개념에 대한 추론, 즉 사회 규범, 문화 및 제도를 형성하는 추상적인 아이디어에 대한 연구는 부족합니다. 이러한 중요하지만 간과된 능력은 사회적 에이전트로서 작동하는 현대 모델에게 필수적이지만, 이를 목표로 하는 체계적인 평가 방법론은 존재하지 않습니다. 본 연구에서는 작업의 불확실성을 고려하여 설계된 프레임워크인 SCRuB(Social Concept Reasoning under Rubric-Based Evaluation)을 소개합니다. SCRuB의 목표는 모델이 인간 전문가 수준의 깊이와 비판적 엄격성을 가지고 사회적 개념에 대해 얼마나 잘 추론하는지를 측정하는 것입니다. SCRuB는 세 단계로 진행됩니다. 첫째, 확립된 자료를 기반으로 프롬프트를 구성합니다. 둘째, 전문가와 모델이 응답을 생성합니다. 셋째, 5차원 비판적 사고 척도를 사용하여 응답을 비교 평가합니다. 파이프라인의 일반화 가능성을 높이기 위해, 독립적인 전문가 평가를 통해 검증된 다양한 학문적 관점을 가진 패널을 도입합니다. 우리는 SCRuBEval(n=4,711개의 평가 프롬프트)과 SCRuBAnnotations(45명의 박사 학위 소지자로부터 작성된 300개의 전문가 응답 및 150개의 전문가 비교 판단)을 공개합니다. 연구 결과는 최첨단 모델이 모든 5가지 척도 차원에서 인간 전문가보다 일관되게 우수한 성능을 보임을 보여줍니다. 1,170건의 쌍대 비교에서, 전문가 평가자들은 80.8%의 경우 모델 응답을 최우수 응답으로 선정했으며, 전체적으로 74.4%의 경우 모델 응답을 선호했습니다. 궁극적으로, 본 연구는 사회적 개념 추론에 대한 평가 포화 상태를 처음으로 전문가의 관점에서 제시합니다. 즉, 단일 라운드의 시험 형식은 모델과 인간 모두에게 있어 한계에 도달했습니다.

Original Abstract

While many studies of Large Language Model (LLM) reasoning capabilities emphasize mathematical or technical tasks, few address reasoning about social concepts: the abstract ideas shaping social norms, culture, and institutions. This understudied capability is essential for modern models acting as social agents, yet no systematic evaluation methodology targets it. We introduce SCRuB (Social Concept Reasoning under Rubric-Based Evaluation), a framework designed for this setting of task indeterminacy. Our goal is to measure the degree to which a model reasons about social concepts with the depth and critical rigor of a human expert. SCRuB proceeds in three phases: prompt construction from established sources, response generation by experts and models, and comparative evaluation using a five-dimensional critical thinking rubric. To enable generalization of the pipeline, we introduce a Panel of Disciplinary Perspectives ensemble validated against independent expert judges. We release SCRuBEval (n=4,711 evaluation prompts) and SCRuBAnnotations (300 expert-authored responses and 150 expert comparative judgments from 45 PhD-level scholars). Our results show that frontier models consistently outperform human experts across all five rubric dimensions. Across 1,170 pairwise comparisons, expert judges ranked a model response first in 80.8% of judgments and preferred model responses overall 74.4% of the time. Ultimately, this study provides the first expert-grounded demonstration of evaluation saturation for social concept reasoning: the single-turn exam-style format has reached its ceiling for models and humans alike.

1 Citations

0 Influential

16 Altmetric

81.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!