2604.13010v1 Apr 14, 2026 cs.LG

Lightning OPD: 오프라인 온폴리시 증류를 이용한 대규모 추론 모델의 효율적인 추가 학습

Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation

Song Han

Citations: 279

h-index: 6

Yecheng Wu

Citations: 939

h-index: 6

H. Cai

Citations: 68

h-index: 5

온폴리시 증류(OPD)는 대규모 언어 모델의 효율적인 추가 학습 방법으로 떠오르고 있습니다. 그러나 기존의 OPD는 학습 과정 전체에 걸쳐 실시간으로 작동하는 교사 모델 추론 서버를 필요로 하여 상당한 인프라 부담을 야기합니다. 본 연구에서는 온폴리시 증류가 오프라인 환경에서도 가능할지 탐구합니다. 직관적인 방법은 교사 모델의 로그 확률을 SFT(Supervised Fine-Tuning) 데이터셋을 사용하여 미리 계산하고, 이를 학습 과정에서 재사용하는 것입니다. 그러나 실제로는 이러한 오프라인 방식이 기존의 OPD만큼의 성능을 안정적으로 달성하지 못했습니다. 이러한 성능 차이를 이해하기 위해, 우리는 기존에 간과되었지만 모든 OPD 파이프라인에서 중요한 조건인 '교사 모델 일관성(teacher consistency)'을 발견했습니다. 이 조건은 지도 학습(SFT)과 OPD를 모두 수행할 때 동일한 교사 모델을 사용해야 함을 의미합니다. 우리는 교사 모델 일관성을 위반하면 해결 불가능한 기울기 편향이 발생하여, 온라인 및 오프라인 OPD 모두 학습 시간과 관계없이 최적의 상태에 도달하지 못한다는 것을 보여줍니다. 이러한 통찰력을 바탕으로, 우리는 교사 모델 일관성을 유지하기 위해 SFT 데이터셋을 사용하여 교사 모델의 로그 확률을 미리 계산하는 오프라인 온폴리시 증류 프레임워크인 Lightning OPD를 제안합니다. 이러한 설계는 실시간 교사 모델 서버의 필요성을 완전히 없앱니다. 또한, 교사 모델 일관성이 유지될 경우, Lightning OPD는 기존의 OPD와 동일한 최적점을 가지며, 경계가 있는 기울기 차이와 정책 드리프트를 방지하는 암묵적인 정규화 효과를 제공합니다. 수학적 추론 및 코드 생성에 대한 광범위한 실험 결과, Lightning OPD는 뛰어난 성능을 달성하며, 기존의 OPD보다 효율성이 크게 향상되었습니다. SFT로 초기화된 Qwen3-8B-Base 모델을 사용하여, Lightning OPD는 단 30시간의 GPU 시간을 사용하여 AIME 2024에서 69.9%의 성능을 달성했으며, 이는 기존의 OPD보다 4.0배 빠른 속도입니다. 이러한 결과는 대규모 언어 모델의 추가 학습에 대한 학술 연구의 진입 장벽을 크게 낮추었습니다.

Original Abstract

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, standard OPD requires a live teacher inference server throughout training, resulting in substantial infrastructure overhead. In this work, we investigate whether on-policy distillation can be performed offline. A natural approach is to precompute teacher log-probabilities once over SFT rollouts and reuse them during training. In practice, however, this offline variant fails to reliably match the performance of standard OPD. To understand this discrepancy, we identify a previously overlooked condition that is critical for any OPD pipeline, which we term teacher consistency. This condition requires that the same teacher model be used for both supervised fine-tuning and OPD. We show that violating teacher consistency introduces an irreducible gradient bias, causing both offline and online OPD to converge to a suboptimal fixed point regardless of training duration. Building on this insight, we propose Lightning OPD, an offline on-policy distillation framework that enforces teacher consistency by precomputing teacher log-probabilities over SFT rollouts. This design eliminates the need for a live teacher server entirely. We further show that, under teacher consistency, Lightning OPD shares the same optimum as standard OPD, with bounded gradient discrepancy and an implicit regularization effect that helps prevent policy drift. Extensive experiments on mathematical reasoning and code generation demonstrate that Lightning OPD achieves state-of-the-art performance with significantly improved efficiency. Starting from an SFT-initialized Qwen3-8B-Base model, Lightning OPD reaches 69.9% on AIME 2024 in just 30 GPU hours, achieving a 4.0x speedup over standard OPD and substantially lowering the barrier to entry for academic research on LLM post-training.

8 Citations

0 Influential

3 Altmetric

23.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!