2605.06654v1 May 07, 2026 cs.LG

옵티마이저-모델 일관성: 사전 훈련에 사용된 동일한 옵티마이저를 사용한 전체 미세 조정은 더 적은 정보 손실을 야기한다

Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less

Jianyu Wang

Citations: 226

h-index: 3

Yuxing Liu

UIUC

Citations: 120

h-index: 6

Tong Zhang

Citations: 36

h-index: 2

대규모 언어 모델(LLM)을 훈련할 때 옵티마이저는 사전 훈련 및 미세 조정 단계 모두에서 중요한 역할을 수행합니다. 본 논문에서는 사전 훈련에 사용된 동일한 옵티마이저를 사용하여 전체 미세 조정을 수행하면 다른 옵티마이저 및 LoRA에 비해 더 나은 학습-망각 균형을 달성할 수 있다는 사실을 제시합니다. 즉, 새로운 작업에서 동일하거나 더 나은 성능을 유지하면서 정보 손실을 덜 발생시킵니다. 우리는 이러한 현상을 옵티마이저-모델 일관성이라고 명명합니다. 이 현상을 더 잘 이해하기 위해, 통제된 실험과 이론적 분석을 통해 다음과 같은 사실을 밝혀냈습니다. 1) 옵티마이저는 활성화에 정규화 효과를 부여하여 모델을 형성하며, 이는 사전 훈련된 체크포인트 주변에 서로 다른 지형을 만듭니다. 2) 이러한 정규화 효과에 대응하여, 사전 훈련에서 학습된 지식을 덜 잊도록 하기 위해서는 미세 조정 단계에서의 가중치 업데이트가 특정 구조를 따라야 하며, 이는 동일한 옵티마이저를 사용함으로써 얻을 수 있습니다. 또한, 사전 훈련 및 미세 조정 단계에서 사용되는 Muon과 AdamW를 비교한 결과, Muon은 추론 작업에 대한 미세 조정 시 성능이 더 낮다는 것을 발견했습니다. 합성 언어 모델링 실험을 통해, 이는 Muon의 강한 암기 경향에서 비롯될 수 있으며, 이는 SFT와 같이 데이터가 적을 때 패턴 학습에 부정적인 영향을 미칠 수 있음을 보여줍니다.

Original Abstract

Optimizers play an important role in both pretraining and finetuning stages when training large language models (LLMs). In this paper, we present an observation that full finetuning with the same optimizer as in pretraining achieves a better learning-forgetting tradeoff, i.e., forgetting less while achieving the same or better performance on the new task, than other optimizers and, possibly surprisingly, LoRA, during the supervised finetuning (SFT) stage. We term this phenomenon optimizer-model consistency. To better understand it, through controlled experiments and theoretical analysis, we show that: 1) optimizers can shape the models by having regularization effects on the activations, leading to different landscapes around the pretrained checkpoints; 2) in response to this regularization effect, the weight update in SFT should follow some specific structures to lower forgetting of the knowledge learned in pretraining, which can be obtained by using the same optimizer. Moreover, we specifically compare Muon and AdamW when they are employed throughout the pretraining and SFT stages and find that Muon performs worse when finetuned for reasoning tasks. With a synthetic language modeling experiment, we demonstrate that this can come from Muon's strong tendency towards rote memorization, which may hurt pattern acquisition with a small amount of data, as for SFT.

1 Citations

0 Influential

3 Altmetric

16.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!