2604.18574v1 Apr 20, 2026 cs.LG

LLM이 약한 감독 하에서 추론 능력을 학습할 수 있는 조건은 언제일까?

When Can LLMs Learn to Reason with Weak Supervision?

Pavel Izmailov

Citations: 9

h-index: 2

Saadia Gabriel

Citations: 3,698

h-index: 13

Jingyan Shen

Citations: 49

h-index: 3

Anna Mordvina

Citations: 0

h-index: 0

Hamid Palangi

Citations: 6,070

h-index: 16

Salman Rahman

Citations: 337

h-index: 3

대규모 언어 모델은 강화 학습을 통해 검증 가능한 보상을 활용하여 상당한 추론 능력 향상을 이루어냈습니다. 하지만 모델의 기능이 발전함에 따라 고품질 보상 신호를 구성하는 것이 점점 더 어려워지고 있으며, 따라서 강화 학습이 약한 형태의 감독 하에서 성공할 수 있는 조건을 이해하는 것이 중요합니다. 본 연구에서는 다양한 모델 아키텍처와 추론 영역에서 세 가지 약한 감독 환경(데이터 부족, 잡음이 많은 보상, 자기 지도적 프록시 보상)을 대상으로 체계적인 실험을 수행했습니다. 연구 결과, 일반화 능력은 학습 보상 포화 역학에 의해 결정됩니다. 일반화 능력을 보이는 모델은 학습 보상과 최종 성능이 함께 상승하는 장기간의 사전 포화 단계를 거치는 반면, 빠르게 포화되는 모델은 암기만 할 뿐 학습을 하지 않습니다. 우리는 중간 단계가 최종 답변을 논리적으로 뒷받침하는 정도를 '추론의 진실성'으로 정의하고, 이 지표가 모델이 어떤 범주에 속하는지 예측하는 사전 학습 속성임을 확인했습니다. 출력 다양성만으로는 일반화 여부를 판단할 수 없습니다. 이러한 연구 결과를 바탕으로, 지속적인 사전 학습과 지도 미세 조정의 기여도를 분석한 결과, 명시적인 추론 과정을 포함하는 데이터에 대한 지도 미세 조정은 약한 감독 하에서 일반화를 위해 필수적이며, 동시에 해당 도메인의 데이터에 대한 지속적인 사전 학습은 이러한 효과를 증폭시키는 것으로 나타났습니다. Llama3.2-3B-Base 모델에 이러한 방법들을 적용한 결과, 기본 모델이 이전에는 실패했던 모든 세 가지 약한 감독 환경에서 일반화가 가능해졌습니다.

Original Abstract

Large language models have achieved significant reasoning improvements through reinforcement learning with verifiable rewards (RLVR). Yet as model capabilities grow, constructing high-quality reward signals becomes increasingly difficult, making it essential to understand when RLVR can succeed under weaker forms of supervision. We conduct a systematic empirical study across diverse model families and reasoning domains under three weak supervision settings: scarce data, noisy rewards, and self-supervised proxy rewards. We find that generalization is governed by training reward saturation dynamics: models that generalize exhibit a prolonged pre-saturation phase during which training reward and downstream performance climb together, while models that saturate rapidly memorize rather than learn. We identify reasoning faithfulness, defined as the extent to which intermediate steps logically support the final answer, as the pre-RL property that predicts which regime a model falls into, while output diversity alone is uninformative. Motivated by these findings, we disentangle the contributions of continual pre-training and supervised fine-tuning, finding that SFT on explicit reasoning traces is necessary for generalization under weak supervision, while continual pre-training on domain data amplifies the effect. Applied together to Llama3.2-3B-Base, these interventions enable generalization across all three settings where the base model previously failed.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!