2602.14462v2 Feb 16, 2026 cs.LG

데이터 병렬 완전 미세 조정에서의 잠재적 불일치: 워커 수준 최적화 불일치를 진단하다

Silent Inconsistency in Data-Parallel Full Fine-Tuning: Diagnosing Worker-Level Optimization Misalignment

Xinyu Wang

Citations: 55

h-index: 4

Hong Li

Citations: 15

h-index: 3

Zhenghao Zhou

Citations: 8

h-index: 2

Honggang Zhang

Citations: 171

h-index: 8

Yu Luo

Citations: 168

h-index: 5

Han Gong

Citations: 6

h-index: 1

Zhiyuan Liu

Citations: 45

h-index: 5

데이터 병렬(DP) 훈련은 동기식 올-리듀스 방식을 사용하여 대규모 언어 모델(LLM)의 전체 파라미터를 미세 조정하는 데 널리 사용되는 방법입니다. 파라미터 동기화는 각 반복 후 모델 가중치의 수치적 동등성을 보장하지만, 그래디언트 집계 전에 워커 수준의 최적화 동역학이 반드시 일치하는 것을 의미하지는 않습니다. 본 논문에서는 '침묵하는 불일치(silent inconsistency)'라고 명명된 이러한 잠재적인 불일치를 식별하고 연구합니다. 여기서 워커 간 손실 및 그래디언트의 차이가 기존의 집계 모니터링 신호 하에서 감지되지 않을 수 있습니다. 우리는 표준 파이프라인에서 쉽게 사용할 수 있는 훈련 신호를 사용하여 워커 수준의 일관성을 정량화하는 가벼운, 모델에 독립적인 진단 프레임워크를 제안합니다. 구체적으로, 손실 분산, 그래디언트-노름 분산, 그리고 워커 간 코사인 유사도를 측정한 그래디언트 방향 일관성이라는 세 가지 상호 보완적인 지표를 소개합니다. 제안된 지표는 미미한 오버헤드를 발생시키며, 모델 아키텍처, 동기화 메커니즘 또는 최적화 알고리즘에 대한 수정이 필요하지 않습니다. 제안된 프레임워크를 검증하기 위해, 8개의 NPU를 사용한 데이터 병렬 설정에서, 제어된 방식으로 랭크 간의 확률적 요소(cross-rank stochasticity)를 변경하면서 1B 파라미터의 exttt{openPangu-Embedded-1B-V1.1} 모델을 exttt{tatsu-lab/alpaca} 데이터셋에 대해 완전 미세 조정했습니다. 실험 결과는 점진적으로 비동기화된 데이터 셔플링 및 난수 시드가 손실/그래디언트 분산을 크게 증가시키고 방향적 일관성을 감소시킨다는 것을 보여주며, 이는 전역적으로 평균화된 손실 곡선이 매끄럽게 유지되는 상황에서도 발생합니다. 이러한 결과는 제안된 지표가 대규모 데이터 병렬 미세 조정에서 숨겨진 불안정성 모드에 대한 실행 가능한 정보를 제공하여, 보다 신뢰할 수 있는 진단 및 구성 평가를 가능하게 한다는 것을 입증합니다.

Original Abstract

Data-parallel (DP) training with synchronous all-reduce is a dominant paradigm for full-parameter fine-tuning of large language models (LLMs). While parameter synchronization guarantees numerical equivalence of model weights after each iteration, it does not necessarily imply alignment of worker-level optimization dynamics before gradient aggregation. This paper identifies and studies this latent mismatch, termed \emph{silent inconsistency}, where cross-worker divergence in losses and gradients can remain invisible under conventional aggregated monitoring signals. We propose a lightweight, model-agnostic diagnostic framework that quantifies worker-level consistency using training signals readily available in standard pipelines. Specifically, we introduce three complementary metrics: loss dispersion, gradient-norm dispersion, and gradient-direction consistency measured by inter-worker cosine similarity. The proposed metrics incur negligible overhead and require no modification to model architecture, synchronization mechanisms, or optimization algorithms. We validate the framework by fully fine-tuning the 1B-parameter \texttt{openPangu-Embedded-1B-V1.1} model on the \texttt{tatsu-lab/alpaca} dataset using an 8-NPU DP setup, under controlled perturbations of cross-rank stochasticity. Experimental results show that progressively desynchronized data shuffling and random seeds lead to substantial increases in loss/gradient dispersion and reduced directional alignment, despite smooth globally averaged loss curves. These findings demonstrate that the proposed indicators provide actionable visibility into hidden instability modes in large-scale DP fine-tuning, enabling more reliable diagnosis and configuration assessment.

0 Citations

0 Influential

4 Altmetric

20.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!