2112.11446 Dec 08, 2021 cs.AI

언어 모델 확장(Scaling): Gopher 학습을 통해 얻은 방법, 분석 및 통찰

Scaling Language Models: Methods, Analysis & Insights from Training Gopher

O. Vinyals

Citations: 260,727

h-index: 103

Amelia Glaese

Citations: 4,961

h-index: 9

Nat McAleese

Citations: 4,443

h-index: 10

John Aslanides

Citations: 7,754

h-index: 14

Maribeth Rauh

Citations: 5,217

h-index: 13

Laura Weidinger

Citations: 5,808

h-index: 19

Jonathan Uesato

Citations: 14,090

h-index: 26

Po-Sen Huang

Citations: 12,066

h-index: 31

Sumanth Dathathri

Citations: 4,466

h-index: 14

Doug Fritz

Citations: 2,592

h-index: 3

Susannah Young

Citations: 3,042

h-index: 7

Iason Gabriel

Citations: 7,787

h-index: 24

William S. Isaac

Citations: 6,130

h-index: 14

John F. J. Mellor

Citations: 5,170

h-index: 10

D. Hassabis

Citations: 188,933

h-index: 91

K. Kavukcuoglu

Citations: 230,975

h-index: 76

Lisa Anne Hendricks

Citations: 21,555

h-index: 32

G. Irving

Citations: 44,396

h-index: 22

Jack W. Rae

Citations: 14,540

h-index: 24

Sebastian Borgeaud

Citations: 28,861

h-index: 20

Trevor Cai

Citations: 12,293

h-index: 9

Katie Millican

Citations: 21,670

h-index: 10

Jordan Hoffmann

Citations: 8,780

h-index: 15

Francis Song

Citations: 3,759

h-index: 9

Sarah Henderson

Citations: 1,842

h-index: 3

Roman Ring

Citations: 20,330

h-index: 9

Eliza Rutherford

Citations: 18,761

h-index: 9

T. Hennigan

Citations: 14,219

h-index: 9

Jacob Menick

Citations: 38,050

h-index: 9

Albin Cassirer

Citations: 10,428

h-index: 9

Richard Powell

Citations: 12,067

h-index: 7

George van den Driessche

Citations: 43,218

h-index: 14

Johannes Welbl

Citations: 13,993

h-index: 19

Saffron Huang

Citations: 4,471

h-index: 7

I. Higgins

Citations: 12,317

h-index: 21

Antonia Creswell

Citations: 8,160

h-index: 19

Amy Wu

Citations: 1,752

h-index: 2

Erich Elsen

Citations: 20,727

h-index: 26

Siddhant M. Jayakumar

Citations: 4,108

h-index: 14

Elena Buchatskaya

Citations: 24,308

h-index: 12

D. Budden

Citations: 11,092

h-index: 27

Esme Sutherland

Citations: 1,584

h-index: 1

K. Simonyan

Citations: 214,422

h-index: 64

Michela Paganini

Citations: 9,472

h-index: 40

L. Sifre

Citations: 58,228

h-index: 28

Xiang Lorraine Li

UMASS Amherst

Citations: 3,489

h-index: 19

A. Kuncoro

Citations: 4,083

h-index: 18

Aida Nematzadeh

Citations: 8,150

h-index: 14

E. Gribovskaya

Citations: 9,357

h-index: 17

Domenic Donato

DeepMind

Citations: 1,686

h-index: 7

Angeliki Lazaridou

Citations: 11,946

h-index: 35

Arthur Mensch

Citations: 18,092

h-index: 13

Jean-Baptiste Lespiau

Citations: 12,712

h-index: 18

M. Tsimpoukelli

Citations: 18,018

h-index: 8

N. Grigorev

Citations: 1,585

h-index: 2

Thibault Sottiaux

Citations: 5,102

h-index: 4

Mantas Pajarskas

Citations: 7,756

h-index: 5

Tobias Pohlen

Citations: 7,528

h-index: 8

Z. Gong

Citations: 1,685

h-index: 7

Daniel Toyama

Citations: 8,323

h-index: 11

Cyprien de Masson d'Autume

Citations: 3,246

h-index: 13

Yujia Li

Citations: 19,128

h-index: 27

Tayfun Terzi

Citations: 4,726

h-index: 6

Vladimir Mikulik

DeepMind

Citations: 4,346

h-index: 17

Igor Babuschkin

Citations: 9,639

h-index: 9

Aidan Clark

Citations: 9,313

h-index: 14

Diego de Las Casas

Citations: 17,205

h-index: 13

Aurelia Guy

Citations: 7,686

h-index: 9

Chris Jones

Citations: 3,480

h-index: 4

James Bradbury

Citations: 64,781

h-index: 11

Matthew G. Johnson

Citations: 1,848

h-index: 2

Blake A. Hechtman

Citations: 6,179

h-index: 16

Edward Lockhart

Citations: 7,200

h-index: 14

Simon Osindero

Citations: 46,713

h-index: 36

Laura Rimell

Citations: 4,999

h-index: 21

Chris Dyer

Citations: 43,560

h-index: 77

Kareem W. Ayoub

Citations: 8,006

h-index: 8

J. Stanway

Citations: 10,768

h-index: 10

L. Bennett

Citations: 1,599

h-index: 2

L. Martens

Citations: 1,630

h-index: 4

언어 모델링은 글로 쓰인 방대한 인간 지식 저장소를 활용하여 세상을 더 잘 예측하고 이해함으로써 지능형 의사소통 시스템을 향한 발판을 제공합니다. 본 논문에서는 수천만 개의 파라미터를 가진 모델부터 Gopher라고 불리는 2,800억 개의 파라미터 모델에 이르기까지, 광범위한 모델 규모에 걸친 트랜스포머(Transformer) 기반 언어 모델의 성능 분석을 제시합니다. 이러한 모델들은 152개의 다양한 태스크에서 평가되었으며, 대다수의 태스크에서 최첨단(SOTA) 성능을 달성했습니다. 규모 확장에 따른 성능 향상은 독해, 팩트 체크, 유해 언어 식별과 같은 영역에서 가장 컸으나, 논리적 및 수학적 추론에서는 그 이득이 상대적으로 적었습니다. 우리는 모델 규모와 편향(bias) 및 유해성(toxicity) 간의 교차점을 포함하여, 학습 데이터셋과 모델의 행동에 대한 포괄적인 분석을 제공합니다. 마지막으로 우리는 AI 안전에 대한 언어 모델의 적용과 다운스트림(downstream) 피해 완화에 대해 논의합니다.

Original Abstract

Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world. In this paper, we present an analysis of Transformer-based language model performance across a wide range of model scales -- from models with tens of millions of parameters up to a 280 billion parameter model called Gopher. These models are evaluated on 152 diverse tasks, achieving state-of-the-art performance across the majority. Gains from scale are largest in areas such as reading comprehension, fact-checking, and the identification of toxic language, but logical and mathematical reasoning see less benefit. We provide a holistic analysis of the training dataset and model's behaviour, covering the intersection of model scale with bias and toxicity. Finally we discuss the application of language models to AI safety and the mitigation of downstream harms.

1587 Citations

96 Influential

30 Altmetric

1,929.0 Score

Original PDF

AI Analysis

Korean Summary

이 논문은 딥마인드(DeepMind)에서 개발한 2,800억(280B) 파라미터 규모의 거대 언어 모델인 'Gopher'를 소개하고, 모델의 크기 확장이 성능에 미치는 영향을 다각도로 분석했습니다. 152개의 다양한 벤치마크 테스트 결과, Gopher는 독해, 팩트 체크, 독성 언어 식별과 같은 지식 집약적 작업에서 기존 최신 모델(SOTA)들을 능가하는 성능을 보였으나, 논리적·수학적 추론 능력에서는 규모 확장에 따른 이득이 상대적으로 적음을 밝혔습니다. 또한, 고품질 데이터셋인 'MassiveText'의 구축 과정, 모델의 독성 및 편향성 분석, 그리고 대규모 모델의 효율적인 학습 및 추론을 위한 기술적 시도와 한계를 포괄적으로 다루고 있습니다.

Key Innovations

2,800억(280B) 파라미터 규모의 Gopher 모델 아키텍처 (RMSNorm 및 상대적 위치 인코딩 적용)
웹, 도서, 뉴스, 코드를 포함한 10.5TB 규모의 고품질 데이터셋 'MassiveText' 구축 및 정교한 필터링 파이프라인
모델 규모에 따른 성능 향상(Scaling Laws)이 작업 유형(지식 대 추론)에 따라 다르게 나타남을 실증적으로 분석
대화형 프롬프팅(Dialogue Prompting)을 통한 챗봇 성능 및 독성 완화 효과 분석
대규모 인프라(TPUv3)에서의 효율적인 병렬화 학습 전략 (데이터, 모델, 파이프라인 병렬화 결합)

Learning & Inference Impact

학습 측면에서는 데이터 병렬화와 모델 병렬화를 결합하고, 옵티마이저 상태 분할(ZeRO)과 활성화 재계산(Rematerialization) 기법을 사용하여 메모리 효율성을 극대화했습니다. 특히 '상대적 위치 인코딩'을 적용하여 학습 때보다 긴 문맥을 추론 시 처리할 수 있도록 했습니다. 추론 효율성 측면에서는 모델 증류(Distillation), 가지치기(Pruning), 희소 학습(Sparse Training) 등 다양한 압축 기법을 시도했으나, 범용 언어 모델링 성능을 유지하면서 모델 크기를 줄이는 데에는 한계가 있음을 확인했습니다. 이는 향후 연구가 단순 압축보다는 Retrieval 기반 등 새로운 아키텍처 탐색으로 나아가야 함을 시사합니다.

Technical Difficulty

고급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!