2602.11137v1 Feb 11, 2026 cs.LG

가중치 감소(Weight Decay)가 언어 모델의 적응성을 향상시킨다

Weight Decay Improves Language Model Plasticity

S. Kakade

Citations: 36,410

h-index: 97

Sebastian Bordt

Citations: 1

h-index: 1

Tessa Han

Citations: 436

h-index: 4

Hanlin Zhang

Citations: 1,180

h-index: 14

대규모 언어 모델(LLM) 개발의 주된 패러다임은 기본 모델을 사전 훈련한 다음, 성능과 모델 동작을 개선하기 위해 추가 훈련을 수행하는 것이다. 그러나 하이퍼파라미터 최적화 및 확장 법칙은 주로 기본 모델의 검증 손실(validation loss) 관점에서 연구되어 왔으며, 다운스트림(downstream) 적응성은 간과되어 왔다. 본 연구에서는 모델의 적응성, 즉 기본 모델이 미세 조정(fine-tuning)을 통해 다운스트림 작업에 성공적으로 적응할 수 있는 능력을 중심으로 사전 훈련을 연구한다. 우리는 사전 훈련 과정에서 중요한 정규화 파라미터인 가중치 감소의 역할을 집중적으로 분석한다. 체계적인 실험을 통해, 더 큰 가중치 감소 값을 사용하여 훈련된 모델이 더 높은 적응성을 가지며, 즉 다운스트림 작업에 미세 조정될 때 더 큰 성능 향상을 보인다는 것을 보여준다. 이러한 현상은 사전 훈련 후 성능이 낮더라도 미세 조정 후 성능이 더 좋을 수 있는 역설적인 상황을 초래할 수 있다. 가중치 감소가 모델 동작에 미치는 메커니즘적 효과에 대한 추가적인 연구 결과, 이는 선형 분리 가능한 표현을 장려하고, 어텐션 행렬을 정규화하며, 훈련 데이터에 대한 과적합(overfitting)을 줄이는 역할을 한다는 것을 밝힌다. 결론적으로, 본 연구는 하이퍼파라미터 최적화에 교차 엔트로피 손실(cross-entropy loss) 이상의 평가 지표를 사용하는 것의 중요성을 강조하며, 단일 최적화 하이퍼파라미터가 모델 동작을 형성하는 데 미치는 다면적인 역할에 대한 통찰력을 제공한다.

Original Abstract

The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.

1 Citations

0 Influential

30 Altmetric

151.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!