2602.03702v1 Feb 03, 2026 cs.LG

언제든지 학습 가능: 가중치 평균을 활용한 수평(horizon)에 제약 없는 학습률 스케줄

Anytime Pretraining: Horizon-Free Learning-Rate Schedules with Weight Averaging

Cengiz Pehlevan

Citations: 2,396

h-index: 28

S. Kakade

Citations: 36,410

h-index: 97

Alexandru Meterez

Harvard University

Citations: 171

h-index: 6

Depen Morwani

Citations: 468

h-index: 9

Pranav Ajit Nair

Citations: 2,874

h-index: 9

대규모 언어 모델은 점점 더 지속적 또는 개방형 환경에서 학습되고 있으며, 이러한 환경에서는 총 학습 기간이 사전에 알려지지 않습니다. 그러나 대부분의 기존 사전 학습 방법은 '언제든지 학습 가능(anytime)'하지 않으며, 이는 미리 정해진 연산 예산 하에서 수평에 의존적인 학습률 스케줄과 광범위한 튜닝을 필요로 합니다. 본 연구에서는 과적합된 선형 회귀 모델에 대한 '언제든지 학습 가능'한 학습 스케줄이 존재한다는 이론적 분석을 제시하고, 확률적 경사 하강법의 최소 최대 수렴률을 달성하는 데 있어 가중치 평균(모델 병합)의 핵심적인 역할을 강조합니다. 우리는 이러한 '언제든지 학습 가능'한 스케줄이 시간이 지남에 따라 다항식적으로 감소하며, 감소율은 문제의 특성과 용량 조건에 의해 결정된다는 것을 보여줍니다. 실험적으로, 150M 및 300M 파라미터의 언어 모델을 1~32배의 Chinchilla 규모로 학습하면서, 상수 학습률, 가중치 평균을 사용한 $1/ ext{√t}$ 스케줄, 그리고 잘 튜닝된 코사인 스케줄을 비교했습니다. 전체 학습 과정에서 '언제든지 학습 가능'한 스케줄은 코사인 감쇠와 유사한 최종 손실 값을 달성했습니다. 종합적으로, 본 연구의 결과는 가중치 평균과 간단하고 수평에 제약 없는 학습률 조정이 대규모 언어 모델 사전 학습을 위한 코사인 학습률 스케줄의 실용적이고 효과적인 대안을 제공할 수 있음을 시사합니다.

Original Abstract

Large language models are increasingly trained in continual or open-ended settings, where the total training horizon is not known in advance. Despite this, most existing pretraining recipes are not anytime: they rely on horizon-dependent learning rate schedules and extensive tuning under a fixed compute budget. In this work, we provide a theoretical analysis demonstrating the existence of anytime learning schedules for overparameterized linear regression, and we highlight the central role of weight averaging - also known as model merging - in achieving the minimax convergence rates of stochastic gradient descent. We show that these anytime schedules polynomially decay with time, with the decay rate determined by the source and capacity conditions of the problem. Empirically, we evaluate 150M and 300M parameter language models trained at 1-32x Chinchilla scale, comparing constant learning rates with weight averaging and $1/\sqrt{t}$ schedules with weight averaging against a well-tuned cosine schedule. Across the full training range, the anytime schedules achieve comparable final loss to cosine decay. Taken together, our results suggest that weight averaging combined with simple, horizon-free step sizes offers a practical and effective anytime alternative to cosine learning rate schedules for large language model pretraining.

0 Citations

0 Influential

30 Altmetric

150.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!