2604.07941v1 Apr 09, 2026 cs.CL

대규모 언어 모델의 추가 학습: 오프라인 학습과 온라인 학습에 대한 통합적 관점

Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning

Hualong Yu

Citations: 15

h-index: 3

Zhihu Wang

Citations: 55

h-index: 3

Chenfei Liu

Citations: 144

h-index: 6

Jiaming Zhou

Citations: 320

h-index: 10

Caiyu Xu

Citations: 88

h-index: 3

Liting Zhang

Citations: 7

h-index: 2

Yuhang Jia

Citations: 54

h-index: 4

Yanzhe Zhang

Citations: 22

h-index: 2

Zicheng Xu

Citations: 186

h-index: 3

Qicheng Li

Citations: 514

h-index: 5

Yong Qin

Citations: 322

h-index: 10

Shiwan Zhao

Citations: 7

h-index: 2

Xu Zhao

Citations: 103

h-index: 3

추가 학습은 사전 학습된 대규모 언어 모델(LLM)을 정렬되고 배포 가능한 시스템으로 만드는 데 핵심적인 역할을 합니다. 최근의 발전은 지도 미세 조정(SFT), 선호도 최적화, 강화 학습(RL), 프로세스 감독, 검증기 기반 방법, 증류 및 다단계 파이프라인을 포함합니다. 그러나 이러한 방법들은 종종 분절된 방식으로 논의되며, 특정 레이블이나 목표에 따라 분류되는 경우가 많으며, 실제로 해결하는 행동적 문제점을 중심으로 정리되지 않습니다. 본 논문은 LLM 추가 학습을 모델의 행동에 대한 체계적인 개입으로 이해하는 것이 가장 효과적이라고 주장합니다. 우리는 먼저 학습 방식을 경로의 출처에 따라 두 가지 주요 체제로 분류합니다. 즉, 외부적으로 제공된 경로를 사용하는 오프라인 학습과, 학습자가 생성한 시뮬레이션을 사용하는 온라인 학습입니다. 그런 다음, 우리는 '효과적인 지원 확장(유용한 행동을 더 쉽게 만들도록 지원)' 및 '정책 재구성(이미 도달 가능한 영역 내에서의 행동 개선)'이라는 두 가지 반복적인 역할과, 단계 및 모델 전환에 걸쳐 행동을 보존, 이전 및 활용하는 '행동 통합'이라는 상호 보완적인 시스템 수준 역할을 통해 다양한 방법들을 해석합니다. 이러한 관점은 주요 패러다임을 통합적으로 이해할 수 있도록 합니다. SFT는 지원 확장 또는 정책 재구성에 모두 사용될 수 있으며, 선호도 기반 방법은 일반적으로 오프라인 재구성에 사용됩니다. 온라인 강화 학습은 종종 학습자가 생성한 상태에서의 행동을 개선하지만, 더 강력한 지침 하에서는 접근하기 어려운 추론 경로를 가능하게 할 수도 있습니다. 증류는 종종 압축보다는 통합의 관점에서 이해하는 것이 더 적절하며, 하이브리드 파이프라인은 조정된 다단계 구성으로 나타납니다. 전반적으로, 이 프레임워크는 추가 학습 과정에서의 문제점을 진단하고 단계 구성에 대한 논리적 근거를 제공하며, LLM 추가 학습의 발전은 특정 목표보다는 조정된 시스템 설계에 점점 더 의존한다는 점을 시사합니다.

Original Abstract

Post-training has become central to turning pretrained large language models (LLMs) into aligned and deployable systems. Recent progress spans supervised fine-tuning (SFT), preference optimization, reinforcement learning (RL), process supervision, verifier-guided methods, distillation, and multi-stage pipelines. Yet these methods are often discussed in fragmented ways, organized by labels or objective families rather than by the behavioral bottlenecks they address. This survey argues that LLM post-training is best understood as structured intervention on model behavior. We organize the field first by trajectory provenance, which defines two primary learning regimes: off-policy learning on externally supplied trajectories, and on-policy learning on learner-generated rollouts. We then interpret methods through two recurring roles -- effective support expansion, which makes useful behaviors more reachable, and policy reshaping, which improves behavior within already reachable regions -- together with a complementary systems-level role, behavioral consolidation, which preserves, transfers, and amortizes behavior across stages and model transitions. This perspective yields a unified reading of major paradigms. SFT may serve either support expansion or policy reshaping, whereas preference-based methods are usually off-policy reshaping. On-policy RL often improves behavior on learner-generated states, though under stronger guidance it can also make hard-to-reach reasoning paths reachable. Distillation is often best understood as consolidation rather than only compression, and hybrid pipelines emerge as coordinated multi-stage compositions. Overall, the framework helps diagnose post-training bottlenecks and reason about stage composition, suggesting that progress in LLM post-training increasingly depends on coordinated system design rather than any single dominant objective.

2 Citations

0 Influential

5 Altmetric

27.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!