2604.03993v1 Apr 05, 2026 cs.LG

LLM이 노이즈가 있는 지도 하에 안정적으로 추론 능력을 학습할 수 있는가?

Can LLMs Learn to Reason Robustly under Noisy Supervision?

Sharon Li

Citations: 239

h-index: 5

Gang Chen

Citations: 531

h-index: 10

Haobo Wang

Citations: 602

h-index: 7

Shenzhi Yang

Citations: 102

h-index: 3

Guangcheng Zhu

Citations: 80

h-index: 2

Bowen Song

Citations: 31

h-index: 3

Xing Zheng

Citations: 54

h-index: 1

Yingfan Ma

Citations: 11

h-index: 1

Zhongqi Chen

Citations: 17

h-index: 2

Weiqiang Wang

Citations: 22

h-index: 3

검증 가능한 보상을 활용한 강화 학습(RLVR)은 풍부한 완벽한 레이블에 의존하는 추론 모델을 효과적으로 학습시키지만, 전문가 부족으로 인해 피할 수 없는 노이즈 레이블에 대한 취약성은 여전히 심각하게 간과되어 왔습니다. 본 연구에서는 RLVR에서 노이즈 레이블 메커니즘에 대한 체계적인 분석의 첫걸음을 내딛습니다. 지도 분류와 달리, 대부분의 RLVR 알고리즘은 롤아웃 기반 조건을 포함합니다. 즉, 레이블이 학습에 미치는 영향은 현재 정책이 해당 레이블을 실현하는 롤아웃을 생성할 수 있는지에 따라 달라지는데, 이는 노이즈 레이블에도 자연스럽게 적용될 수 있는 속성입니다. 이러한 관찰을 바탕으로, 데이터 효율성을 저해하는 비활성 노이즈 레이블과 모델을 잘못된 분포로 편향시킬 위험이 있는 활성 노이즈 레이블의 두 가지 유형을 구분합니다. 노이즈 샘플을 사용한 학습 실험을 통해, '초기 정확성 일관성(Early Correctness Coherence)' 현상을 확인했습니다. 즉, 노이즈 샘플은 학습 후반 단계에서 뒤쳐지기 시작하지만, 초기 학습 단계에서는 깨끗한 샘플과 노이즈 샘플 모두에 대한 정확도가 유사하게 증가합니다. 이러한 동향에 따라, 잠재적으로 노이즈가 포함된 레이블을, 다수 투표를 통해 얻은 답변으로 점진적으로 수정하는 '온라인 레이블 정제(Online Label Refinement, OLR)' 기법을 제안합니다. OLR은 두 가지 조건이 충족될 때 적용됩니다. 첫째, 다수 답변의 롤아웃 성공률이 양의 기울기를 갖는 경우, 둘째, 업데이트를 통해 일관된 과거 기록을 유지하는 경우입니다. 이를 통해 정책이 개선됨에 따라 점진적인 자체 수정이 가능합니다. OLR은 AIME24/25, AMC, MATH-500, Minerva, Olympiad의 6가지 동일 분포 수학 추론 벤치마크와 ARC-c, GPQA-diamond, MMLU-pro의 3가지 외부 분포 작업에 대해 평가되었습니다. 노이즈 비율이 0.1에서 0.9로 변하는 다양한 조건에서, OLR은 비활성 및 활성 노이즈 레이블 환경 모두에서 일관되게 견고성을 향상시켰으며, 동일 분포 벤치마크에서 평균 3.6%에서 3.9%, 외부 분포 평가에서 평균 3.3%에서 4.6%의 성능 향상을 달성했습니다.

Original Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) effectively trains reasoning models that rely on abundant perfect labels, but its vulnerability to unavoidable noisy labels due to expert scarcity remains critically underexplored. In this work, we take the first step toward a systematic analysis of noisy label mechanisms in RLVR. In contrast to supervised classification, most RLVR algorithms incorporate a rollout-based condition: a label's influence on training is contingent on whether the current policy can generate rollouts that realize it, a property that naturally extends to noisy labels. Based on this observation, we distinguish two types of noise: inactive noisy labels, which reduce data efficiency, and active noisy labels, which are reinforced and risk skewing the model toward incorrect distributions. From experiments on training with noisy samples, we identify an Early Correctness Coherence phenomenon: although noisy samples begin to lag behind in later stages, accuracy on both clean and noisy samples increases similarly in early training. Motivated by this dynamic, we propose Online Label Refinement (OLR), which progressively corrects potentially noisy labels with majority-voted answers when two conditions hold: a positive slope in the majority answer's rollout pass rate and stable historical consistency across updates, enabling gradual self-correction as the policy improves. We evaluate OLR on six in-distribution mathematical reasoning benchmarks (AIME24/25, AMC, MATH-500, Minerva, and Olympiad) and three out-of-distribution tasks (ARC-c, GPQA-diamond, and MMLU-pro). Across noise ratios from 0.1 to 0.9, OLR consistently improves robustness under both inactive and active noisy-label settings, achieving average gains of 3.6% to 3.9% on in-distribution benchmarks and 3.3% to 4.6% on out-of-distribution evaluations.

1 Citations

0 Influential

5 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!