2604.27295v1 Apr 30, 2026 cs.AI

학습률 엔지니어링: 단일 매개변수에서 계층별 진화까지

Learning Rate Engineering: From Coarse Single Parameter to Layered Evolution

Di Wang

Citations: 18

h-index: 3

Mingshuai Yao

Citations: 20

h-index: 3

Jianqiao Cui

Citations: 1

h-index: 1

Jin-Yan Chen

Citations: 0

h-index: 0

Zihui Cui

Citations: 1

h-index: 1

Fa Wang

Citations: 0

h-index: 0

Chen Wei

Citations: 202

h-index: 2

Qiuxia Yu

Citations: 12

h-index: 2

학습률 스케줄링은 초기 SGD의 단일 고정 학습률에서 정교한 계층별 적응 전략으로 발전해 왔습니다. 우리는 이러한 진화를 다섯 세대로 체계화했습니다: (Gen1) 전역 고정 학습률, (Gen2) 전역 스케줄링, (Gen3) 매개변수 수준 적응, (Gen4) 계층 수준 차등화, 그리고 (Gen5) 계층-시간 동시 스케줄링. 우리는 각 단계 전환 뒤에 숨겨진 근본적인 동기를 추적하여, '획일적' 접근 방식에서 계층 및 시간에 따른 맞춤형 접근 방식으로의 전환이 전이 학습의 '해결 불가능한 삼위일체' 문제를 어떻게 해결하는지 보여줍니다. 즉, 하위 계층은 일반적인 지식을 보존하기 위해 작은 업데이트가 필요하고, 상위 계층은 새로운 작업에 적응하기 위해 큰 업데이트가 필요합니다. 이러한 분류를 바탕으로, 우리는 단계 적응 코사인 스케줄링, 깊이 인지 Grokfast 기울기 필터링, 그리고 LARS 스타일의 신뢰 비율을 단일 통합 최적화기로 통합하는 통합 프레임워크인 Discriminative Adaptive Layer Scaling (DALS)을 제안합니다. 우리는 세 가지 DALS 변형을 포함하여 18가지 전략을 다섯 가지 데이터셋(합성 데이터, CIFAR-10 (처음부터 학습), RTE, TREC-6, 그리고 IMDb (미세 조정))에서 벤치마킹했습니다. 합성 데이터에서 DALS는 98.0%의 최고 정확도를 달성했으며, DALS-Fast는 단 3 에포크 만에 90%를 달성했습니다. 데이터셋 간 분석 결과, 영역에 따라 뚜렷한 패턴이 나타났으며, 어떤 전략도 모든 영역에서 우위를 점하지 못했습니다. 특히, ULMFiT의 우승자인 STLR+Discriminative는 처음부터 학습하는 작업에서 심각한 실패를 보였습니다 (처음부터 학습한 TREC-6에서 43.6% vs. RAdam 사용 시 96.8%), 이는 사전 학습된 특징이 없는 경우 방향성 감쇠 편향이 해롭다는 것을 확인시켜줍니다. DALS는 극단적인 접근 방식을 피하고, 합성 데이터에서 최고의 결과를 달성하면서도 경쟁력 있는 미세 조정 성능을 유지합니다.

Original Abstract

Learning rate scheduling has evolved from the single global fixed rate of early SGD to sophisticated layer-wise adaptive strategies. We systematize this evolution into five generations: (Gen1) global fixed learning rates, (Gen2) global scheduling, (Gen3) parameter-level adaptation, (Gen4) layer-level differentiation, and (Gen5) joint layer-time scheduling. We trace the fundamental motivation behind each transition, showing how the shift from one-size-fits-all to tailoring by layer and time addresses the impossible trinity of transfer learning: lower layers require small updates to preserve general knowledge while higher layers need large updates to adapt to new tasks. Building on this taxonomy, we propose Discriminative Adaptive Layer Scaling (DALS), a unified framework that integrates phase-adaptive cosine scheduling, depth-aware Grokfast gradient filtering, and LARS-style trust ratios into a single coherent optimizer. We benchmark 18 strategies including three DALS variants across all five generations on five datasets: synthetic, CIFAR-10 (from scratch), RTE, TREC-6, and IMDb (fine-tuning). On synthetic, DALS achieves the best accuracy at 98.0%, while DALS-Fast reaches 90% in just 3 epochs. The cross-dataset analysis reveals striking regime-dependent patterns -- no single strategy wins across all regimes. Critically, STLR+Discriminative, the ULMFiT champion, catastrophically fails on from-scratch tasks (43.6% on TREC-6 from scratch vs. 96.8% with RAdam), confirming that directional decay biases are harmful without pretrained features. DALS avoids either extreme, achieving the best synthetic result while maintaining competitive fine-tuning performance.

0 Citations

0 Influential

1.5 Altmetric

7.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!