2208.01448 Aug 02, 2022 cs.AI

AlexaTM 20B: 대규모 다국어 Seq2Seq 모델을 이용한 퓨샷 학습

AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

Saleh Soltan

Citations: 1,521

h-index: 21

Shankar Ananthakrishnan

Citations: 388

h-index: 7

Jack G. M. FitzGerald

Alexa AI

Citations: 428

h-index: 7

Rahul Gupta

Department of Electrical Engineering, University of Southern California, Los Angeles, California

Citations: 987

h-index: 20

Wael Hamza

Amazon Alexa

Citations: 2,330

h-index: 21

Haidar Khan

Citations: 704

h-index: 10

Charith Peris

Amazon

Citations: 690

h-index: 11

Stephen Rawls

Citations: 974

h-index: 13

Andrew Rosenbaum

Citations: 547

h-index: 8

Anna Rumshisky

Citations: 6,720

h-index: 35

C. Prakash

Citations: 204

h-index: 5

Mukund Sridhar

Citations: 211

h-index: 6

Fabian Triefenbach

Citations: 571

h-index: 10

Apurv Verma

NJIT, Bloomberg, Amazon, Georgia Institute of Technology

Citations: 320

h-index: 8

Gokhan Tur

University of Illinois at Urbana Champaign

Citations: 3,458

h-index: 22

Premkumar Natarajan

Citations: 591

h-index: 10

본 연구에서는 디노이징(denoising) 및 인과적 언어 모델링(CLM) 태스크를 혼합하여 사전 학습된 대규모 다국어 시퀀스-투-시퀀스(seq2seq) 모델이 다양한 태스크에서 디코더 전용(decoder-only) 모델보다 더 효율적인 퓨샷 학습기임을 입증합니다. 특히, 우리는 Alexa Teacher Model(AlexaTM 20B)이라는 200억 개의 파라미터를 가진 다국어 seq2seq 모델을 훈련시켰으며, 이 모델이 1-shot 요약 태스크에서 훨씬 더 큰 540B PaLM 디코더 모델을 능가하여 최고 수준(SOTA)의 성능을 달성함을 보여줍니다. 또한 AlexaTM 20B는 Flores-101 데이터셋을 기준으로 모델이 지원하는 거의 모든 언어 쌍(아랍어, 영어, 프랑스어, 독일어, 힌디어, 이탈리아어, 일본어, 마라티어, 포르투갈어, 스페인어, 타밀어, 텔루구어)에 대해, 특히 저자원 언어에서의 1-shot 기계 번역에서 SOTA를 달성했습니다. 아울러 제로샷 환경에서도 AlexaTM 20B가 SuperGLUE 및 SQuADv2 데이터셋에서 GPT3(175B)를 능가하며, XNLI, XCOPA, Paws-X, XWinograd와 같은 다국어 태스크에서 SOTA 성능을 제공함을 보여줍니다. 종합적으로, 우리의 결과는 seq2seq 모델이 대규모 언어 모델(LLM) 훈련에 있어 디코더 전용 모델에 대한 강력한 대안이 될 수 있다는 설득력 있는 근거를 제시합니다.

Original Abstract

In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2 datasets and provides SOTA performance on multilingual tasks such as XNLI, XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case for seq2seq models as a powerful alternative to decoder-only models for Large-scale Language Model (LLM) training.

90 Citations

12 Influential

17.5 Altmetric

201.5 Score

Original PDF

AI Analysis

Korean Summary

본 논문은 Amazon Alexa AI가 개발한 200억(20B) 파라미터 규모의 다국어 시퀀스-투-시퀀스(seq2seq) 모델인 AlexaTM 20B를 소개한다. 이 모델은 디노이징(Denoising)과 인과적 언어 모델링(CLM)을 혼합하여 사전 학습되었으며, 12개 언어를 지원한다. 연구진은 이 모델이 훨씬 더 큰 규모의 디코더 전용 모델(예: PaLM 540B, GPT-3 175B)과 비교했을 때, 1-shot 요약 및 기계 번역 등에서 더 뛰어나거나 대등한 성능을 보임을 입증했다. 결과적으로 seq2seq 아키텍처가 퓨샷(few-shot) 학습 및 대규모 언어 모델(LLM) 훈련에 있어 디코더 전용 모델의 강력하고 효율적인 대안이 될 수 있음을 시사한다.

Key Innovations

디노이징(Denoising)과 인과적 언어 모델링(CLM)을 80:20 비율로 결합한 혼합 사전 학습 전략 적용
20B 파라미터 규모로 540B 규모의 PaLM 모델을 능가하는 1-shot 요약 성능 및 효율성 달성
인코더-디코더 구조를 활용하여 긴 문맥(Long-context) 처리 능력 강화 및 양방향 어텐션 활용
저자원 언어를 포함한 12개 언어 간의 1-shot 기계 번역(MT)에서 SOTA 달성
Fusion-in-Decoder(FiD) 기법을 활용한 효율적인 퓨샷 예제 인코딩 지원

Learning & Inference Impact

학습 측면에서는 인코더의 양방향 문맥 이해 능력과 디코더의 생성 능력을 동시에 배양하기 위해 스팬 오염(span corruption) 기반의 디노이징과 CLM을 혼합하여 훈련했다. 추론 시에는 입력에 마스크 토큰을 사용하지 않아 학습-추론 간 불일치를 줄였으며, 태스크에 따라 '디노이징 모드'와 'CLM 모드'를 선택적으로 사용할 수 있다. 특히 인코더-디코더 구조는 디코더 전용 모델보다 긴 문맥을 다루는 데 유리하며, Fusion-in-Decoder 기술을 통해 더 많은 퓨샷 예제를 인코더에 입력하여 모델의 추론 성능을 극대화할 수 있다.

Technical Difficulty

중급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!