2601.22595v1 Jan 30, 2026 cs.AI

더 적은 것으로 더 많은 학습을: RLVR을 위한 불확실성 일관성 기반 쿼리 선택

Learn More with Less: Uncertainty Consistency Guided Query Selection for RLVR

Yulan Hu

Citations: 48

h-index: 4

Xin Li

Citations: 2

h-index: 1

Ouyang Sheng

Citations: 45

h-index: 4

Lizhong Ding

Citations: 8

h-index: 2

Yong Liu

Citations: 46

h-index: 4

Hao Yi

Citations: 6

h-index: 2

대규모 언어 모델(LLM)은 최근 검증 가능한 보상을 활용한 강화학습(RLVR)을 통해 수학적 추론 능력을 향상시켰다. 그러나 기존 RLVR 알고리즘은 막대한 쿼리 예산을 필요로 하여 주석 비용이 많이 든다. 본 연구는 RLVR에 능동 학습(AL)을 도입하여, 더 적지만 더 유익한 쿼리로 유사하거나 더 나은 성능을 낼 수 있는지 조사한다. 우리는 이 환경에서 고전적 AL 샘플링 전략이 주관적 불확실성만으로 선택할 때 객관적 불확실성을 무시하기 때문에 무작위 선택을 능가하지 못한다는 것을 확인했다. 이에 본 연구는 주관적 불확실성이 객관적 불확실성과 얼마나 잘 일치하는지 평가하기 위해 불확실성 일관성 지표를 제안한다. 오프라인 환경에서 이 일치도는 점-양분 상관 계수(PBC)를 사용하여 측정된다. 온라인 훈련의 경우, 제한된 샘플링과 동적으로 변화하는 출력 분포 때문에 PBC 추정이 어렵다. 따라서 우리는 정규화된 어드밴티지와 주관적 불확실성으로 계산되는 새로운 온라인 변형 지표를 도입한다. 이론적으로, 우리는 이 온라인 변형 지표가 오프라인 PBC와 엄격한 음의 상관관계를 가지며 더 나은 샘플 선택을 지원함을 증명한다. 실험 결과, 제안된 방법은 무작위 및 고전적 AL 베이스라인을 일관되게 능가하며, 데이터의 30%만 훈련에 사용하여 전체 데이터셋 성능을 달성함으로써 추론 작업을 위한 RLVR 비용을 효과적으로 절감함을 보여준다.

Original Abstract

Large Language Models (LLMs) have recently improved mathematical reasoning through Reinforcement Learning with Verifiable Reward (RLVR). However, existing RLVR algorithms require large query budgets, making annotation costly. We investigate whether fewer but more informative queries can yield similar or superior performance, introducing active learning (AL) into RLVR. We identify that classic AL sampling strategies fail to outperform random selection in this setting, due to ignoring objective uncertainty when only selecting by subjective uncertainty. This work proposes an uncertainty consistency metric to evaluate how well subjective uncertainty aligns with objective uncertainty. In the offline setting, this alignment is measured using the Point-Biserial Correlation Coefficient (PBC). For online training, because of limited sampling and dynamically shifting output distributions, PBC estimation is difficult. Therefore, we introduce a new online variant, computed from normalized advantage and subjective uncertainty. Theoretically, we prove that the online variant is strictly negatively correlated with offline PBC and supports better sample selection. Experiments show our method consistently outperforms random and classic AL baselines, achieving full-dataset performance while training on only 30% of the data, effectively reducing the cost of RLVR for reasoning tasks.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!