2604.22407v1 Apr 24, 2026 cs.LG

지속적 학습에서 Adam 기반 경사 수정의 숨겨진 오류 모드, 그리고 적응적 분리 모멘트 라우팅을 통한 해결책

Hidden Failure Modes of Gradient Modification under Adam in Continual Learning, and Adaptive Decoupled Moment Routing as a Repair

Yuelin Hu

Citations: 5

h-index: 2

Zhengxue Cheng

Citations: 263

h-index: 9

Li Song

Citations: 71

h-index: 3

Wei Liu

Citations: 12

h-index: 3

Zhenbo Yu

Citations: 25

h-index: 3

많은 지속적 학습 방법들이 경사를 수정합니다 (예: 투영, 페널티 재조정, 리플레이 혼합), 이때 Adam을 단순히 뒷단으로 사용합니다. 본 연구에서는 이러한 조합이 숨겨진 오류 모드를 가진다는 것을 보여줍니다. 높은 중첩률과 비적응적 환경에서의 8개 도메인 지속적 언어 모델 실험에서, 공유 라우팅 투영 기반 모델들은 기본적인 망각 수준과 거의 동일한 성능을 보였습니다 (12.5--12.8 vs. 13.2). 0.5%의 리플레이 버퍼는 가장 강력한 대안이지만, 여전히 11.6의 성능을 보입니다. 반면, 고정 강도의 분리 방법은 기본적인 모델보다 낮은 14.1의 성능을 보입니다. 반면, 적응적 분리 라우팅은 9.4의 안정적인 성능을 유지하며, 기본적인 모델보다 3.8 단위 향상된 성능을 보입니다. 16개 도메인 스트림 환경에서는, 가장 강력한 공유 라우팅 투영 기반 모델보다 4.5--4.8 단위 더 높은 성능 향상을 보입니다. 이러한 문제는 일반적으로 깨끗한 벤치마크에서는 잘 드러나지 않습니다. 본 연구는 Adam의 두 번째 모멘트 경로를 통해 이러한 현상을 설명합니다. 실험 환경에서, 투영은 이전 방향의 효과적인 학습률을 1/(1-alpha)만큼 증가시키며, 8가지 alpha 값에 대해 측정 결과가 8% 이내로 일치합니다. 유사한 문제가 페널티 방법, 리플레이 혼합, 그리고 70억 파라미터 규모의 LoRA 환경에서도 나타납니다. 본 연구에서는 수정된 경사를 첫 번째 모멘트에만 전달하고, 크기를 유지하면서 두 번째 모멘트의 통계 정보를 보존하는, 중첩률을 고려한 적응적 강도 방식을 사용했습니다. 이 간단한 변경은, 다양한 방법, 최적화 알고리즘, 그리고 규모에서도 일관되게 오류를 방지하는 유일한 방법입니다.

Original Abstract

Many continual-learning methods modify gradients upstream (e.g., projection, penalty rescaling, replay mixing) while treating Adam as a neutral backend. We show this composition has a hidden failure mode. In a high-overlap, non-adaptive 8-domain continual LM, all shared-routing projection baselines collapse close to vanilla forgetting (12.5--12.8 vs. 13.2). A 0.5% replay buffer is the strongest shared alternative but still reaches 11.6, while fixed-strength decoupling falls below vanilla at 14.1. Only adaptive decoupled routing remains stable at 9.4, improving over vanilla by 3.8 units. On a 16-domain stream, its gain over the strongest shared-routing projection baseline grows to 4.5--4.8 units. The failure is largely invisible on clean benchmarks. We explain this effect through Adam's second-moment pathway: in the tested regime, projection induces a 1/(1-alpha) inflation of the old-direction effective learning rate, matching measurements within 8% across eight alpha values. The same conflict appears with penalty methods, replay mixing, and at 7B scale under LoRA. Our fix routes the modified gradient only to the first moment while preserving magnitude-faithful second-moment statistics, with overlap-aware adaptive strength. This simple change is the only tested configuration that consistently avoids collapse across methods, optimizers, and scale.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!