2601.10684v1 Jan 15, 2026 cs.LG

신경망 스케일링 법칙의 기원: 랜덤 그래프에서 자연어까지

On the origin of neural scaling laws: from random graphs to natural language

M. Barkeshli

Citations: 3,799

h-index: 33

Alberto Alfarano

Citations: 3

h-index: 1

Andrey Gromov

Citations: 333

h-index: 5

스케일링 법칙은 현대 AI 혁명에서 중요한 역할을 수행하며, 모델 성능이 데이터, 컴퓨팅 자원, 모델 파라미터 수 증가에 따라 어떻게 개선될지에 대한 예측력을 제공합니다. 이러한 스케일링 법칙의 기원에 대한 관심이 높아지면서, 흔히 데이터 내에 존재하는 파워 로(power law) 구조에서 비롯된다는 주장이 제기되고 있습니다. 본 논문에서는 조절 가능한 복잡성을 가진 그래프에서 랜덤 워크(bigram)를 예측하도록 학습된 트랜스포머 모델의 스케일링 법칙을 연구합니다. 단순화된 환경에서도 데이터 상관 관계에 파워 로 구조가 존재하지 않더라도 신경망 스케일링 법칙이 나타남을 보여줍니다. 또한, 4, 2, 1 레이어 트랜스포머 언어 모델부터 언어 bigram까지 점진적으로 복잡성을 낮춘 자연어 데이터를 사용하여 학습함으로써, 스케일링 지수의 단조적인 변화를 관찰합니다. 또한, Erdös-Renyi 및 scale-free Barabási-Albert 앙상블에서 추출된 랜덤 그래프에서의 랜덤 워크 학습을 통해 얻은 스케일링 법칙도 분석합니다. 마지막으로, 기존의 언어 모델링 스케일링 법칙을 재검토하여, 2 레이어 트랜스포머와 50의 컨텍스트 길이로 여러 중요한 결과를 재현할 수 있음을 보여줍니다. 또한, 기존 문헌에서 사용된 다양한 피팅 방법들을 비판적으로 분석하고, 현재 발표된 문헌의 관행과 비교하여 컴퓨팅 효율적인 곡선을 얻는 대체 방법을 제시하며, 최대 업데이트 파라미터화(maximal update parameterization)가 표준 파라미터화보다 더 효율적일 수 있다는 초기 증거를 제공합니다.

Original Abstract

Scaling laws have played a major role in the modern AI revolution, providing practitioners predictive power over how the model performance will improve with increasing data, compute, and number of model parameters. This has spurred an intense interest in the origin of neural scaling laws, with a common suggestion being that they arise from power law structure already present in the data. In this paper we study scaling laws for transformers trained to predict random walks (bigrams) on graphs with tunable complexity. We demonstrate that this simplified setting already gives rise to neural scaling laws even in the absence of power law structure in the data correlations. We further consider dialing down the complexity of natural language systematically, by training on sequences sampled from increasingly simplified generative language models, from 4,2,1-layer transformer language models down to language bigrams, revealing a monotonic evolution of the scaling exponents. Our results also include scaling laws obtained from training on random walks on random graphs drawn from Erdös-Renyi and scale-free Barabási-Albert ensembles. Finally, we revisit conventional scaling laws for language modeling, demonstrating that several essential results can be reproduced using 2 layer transformers with context length of 50, provide a critical analysis of various fits used in prior literature, demonstrate an alternative method for obtaining compute optimal curves as compared with current practice in published literature, and provide preliminary evidence that maximal update parameterization may be more parameter efficient than standard parameterization.

3 Citations

0 Influential

16.5 Altmetric

85.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!