2601.08430v1 Jan 13, 2026 cs.AI

RubricHub: 자동화된 Coarse-to-Fine 생성을 통한 포괄적이고 변별력 높은 루브릭 데이터셋

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

Sunzhu Li

Citations: 30

h-index: 2

Jiale Zhao

Citations: 33

h-index: 2

Miteto Wei

Citations: 6

h-index: 1

Huimin Ren

Citations: 6

h-index: 1

Yang Zhou

Citations: 30

h-index: 2

Jingwen Yang

Citations: 29

h-index: 2

Kaike Zhang

Citations: 6

h-index: 1

Wei Chen

Citations: 7

h-index: 1

Shunyu Liu

Citations: 555

h-index: 9

검증 가능한 보상을 이용한 강화 학습(RLVR)은 수학과 같은 추론 집약적 도메인에서 상당한 진전을 이끌어냈습니다. 그러나 개방형 생성 최적화는 정답(ground truth)의 부재로 인해 여전히 어려운 과제로 남아 있습니다. 루브릭 기반 평가는 검증을 위한 구조화된 대안을 제공하지만, 기존 방법들은 확장성 병목 현상과 기준의 정밀도 부족(coarse criteria)으로 인해 지도(supervision) 성능의 한계에 부딪혔습니다. 이를 해결하기 위해 본 논문에서는 자동화된 Coarse-to-Fine 루브릭 생성 프레임워크를 제안합니다. 원칙 기반 합성, 다중 모델 집계, 난이도 고도화를 결합하여, 우리의 접근 방식은 미묘한 뉘앙스까지 포착할 수 있는 포괄적이고 변별력 높은 기준을 생성합니다. 이 프레임워크를 기반으로 대규모(약 11만 개) 다중 도메인 데이터셋인 RubricHub를 소개합니다. 우리는 루브릭 기반 기각 샘플링 미세 조정(RuFT)과 강화 학습(RuRL)으로 구성된 2단계 사후 학습 파이프라인을 통해 그 효용성을 검증했습니다. 실험 결과, RubricHub를 통해 상당한 성능 향상을 확인했습니다. 사후 학습된 Qwen3-14B 모델은 HealthBench에서 69.3점을 기록하며 GPT-5와 같은 독점적인 최첨단 모델을 능가하는 SOTA 결과를 달성했습니다. 코드와 데이터는 곧 공개될 예정입니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale ($\sim$110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. The code and data will be released soon.

6 Citations

0 Influential

4.5 Altmetric

28.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!