2603.08095v1 Mar 09, 2026 cs.CL

DC-W2S: 신뢰성 있는 생물학적 추론을 위한 이중 합의 기반 약-강 학습: 공정 보상 모델링

DC-W2S: Dual-Consensus Weak-to-Strong Training for Reliable Process Reward Modeling in Biological Reasoning

Chi-Min Chan

Citations: 256

h-index: 7

Ehsan Hajiramezanali

Citations: 566

h-index: 9

Xiner Li

Citations: 1,094

h-index: 13

E. Brouwer

Citations: 851

h-index: 11

C. Edwards

Citations: 1

h-index: 1

Wei Xue

Citations: 55

h-index: 5

Sirui Han

Citations: 43

h-index: 3

Yike Guo

Citations: 25

h-index: 3

Gabriele Scalia

Citations: 90

h-index: 5

과학적 추론 작업에서, 추론 과정의 정확성은 최종 결과만큼 중요합니다. 공정 보상 모델(PRM)은 결과 보상 모델(ORM)의 한계인 세분화된 지도 학습 문제를 해결할 수 있지만, 전문가가 검증한 단계별 레이블을 얻는 데 드는 높은 비용 때문에 활용이 제한됩니다. 본 논문에서는 풍부하지만 노이즈가 많은 '약한' 지도를 사용하여 신뢰성 있는 PRM을 학습하는 문제를 다룹니다. 기존의 약-강 일반화(W2SG) 이론은 노이즈가 많은 데이터에서 고품질의 학습 신호를 선택하기 위한 구체적인 지침이 부족하다고 주장합니다. 이러한 격차를 해소하기 위해, 이중 합의 기반 약-강(DC-W2S) 프레임워크를 제안합니다. DC-W2S는 약한 감독자 간의 자기 합의(SC) 메트릭과 임베딩 공간에서의 이웃 합의(NC) 메트릭을 결합하여 지도 신호를 신뢰도 수준별로 분류합니다. 그런 다음, 인스턴스 수준의 균형 잡힌 샘플링과 레이블 수준의 신뢰도 기반 마스킹을 활용하여 학습 과정을 안내합니다. 실험 결과, DC-W2S는 광범위한 전문가 주석 없이도 복잡한 추론을 위한 강력한 PRM을 학습할 수 있음을 보여주며, 이는 방대한 규모의 노이즈가 많은 데이터셋에 대한 무분별한 학습보다 전략적인 데이터 큐레이션이 더 효과적임을 입증합니다.

Original Abstract

In scientific reasoning tasks, the veracity of the reasoning process is as critical as the final outcome. While Process Reward Models (PRMs) offer a solution to the coarse-grained supervision problems inherent in Outcome Reward Models (ORMs), their deployment is hindered by the prohibitive cost of obtaining expert-verified step-wise labels. This paper addresses the challenge of training reliable PRMs using abundant but noisy "weak" supervision. We argue that existing Weak-to-Strong Generalization (W2SG) theories lack prescriptive guidelines for selecting high-quality training signals from noisy data. To bridge this gap, we introduce the Dual-Consensus Weak-to-Strong (DC-W2S) framework. By intersecting Self-Consensus (SC) metrics among weak supervisors with Neighborhood-Consensus (NC) metrics in the embedding space, we stratify supervision signals into distinct reliability regimes. We then employ a curriculum of instance-level balanced sampling and label-level reliability-aware masking to guide the training process. We demonstrate that DC-W2S enables the training of robust PRMs for complex reasoning without exhaustive expert annotation, proving that strategic data curation is more effective than indiscriminate training on large-scale noisy datasets.

0 Citations

0 Influential

6.5 Altmetric

32.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!