2305.02309 May 03, 2023 cs.AI

CodeGen2: 프로그래밍 언어와 자연어 기반 LLM 학습을 통해 얻은 교훈

CodeGen2: Lessons for Training LLMs on Programming and Natural Languages

Erik Nijkamp

Citations: 4,973

h-index: 20

Hiroaki Hayashi

Carnegie Mellon University

Citations: 8,063

h-index: 14

Caiming Xiong

Citations: 18,532

h-index: 39

S. Savarese

Citations: 80,550

h-index: 115

Yingbo Zhou

Citations: 3,989

h-index: 25

거대 언어 모델(LLM)은 프로그램 합성 및 이해 작업을 위한 표현 학습에서 놀라운 능력을 보여주었습니다. 학습된 표현의 품질은 모델 파라미터와 관측 데이터 수의 함수인 신경망 스케일링 법칙에 의해 좌우되는 것으로 보이며, 동시에 가용한 데이터와 연산량에 의해 모델 성능의 상한이 결정되는데, 이는 많은 비용을 수반합니다. 본 연구에서는 (1) 모델 아키텍처, (2) 학습 방법, (3) 인필(infill) 샘플링, (4) 데이터 분포라는 네 가지 핵심 요소를 통합하여 프로그램 합성을 위한 LLM 학습을 더욱 효율적으로 만들고자 합니다. 구체적으로, 모델 아키텍처의 경우 인코더 및 디코더 기반 모델을 단일 접두사 언어 모델(prefix-LM)로 통합하려고 시도합니다. 학습 방법의 경우, (i) 인과적 언어 모델링, (ii) 스팬 손상(span corruption), (iii) 인필링(infilling)을 하나의 간단한 학습 알고리즘으로 통합합니다. 인필 샘플링에 대해서는 '공짜 점심(free lunch)' 가설에 대한 주장을 탐구합니다. 데이터 분포와 관련해서는 프로그래밍 언어와 자연어의 혼합 분포 및 다중 에포크 학습이 모델 성능에 미치는 영향을 탐구합니다. 우리는 1B 파라미터 규모의 LLM에 대해 포괄적인 경험적 실험을 수행하였으며, 이 탐구 과정에서의 실패와 성공을 다섯 가지 교훈으로 정리하였습니다. 우리는 최종적인 학습 레시피를 제공하고, 1B, 3.7B, 7B, 16B 파라미터 크기의 CodeGen2 모델과 학습 프레임워크를 오픈 소스(https://github.com/salesforce/CodeGen)로 공개합니다.

Original Abstract

Large language models (LLMs) have demonstrated remarkable abilities in representation learning for program synthesis and understanding tasks. The quality of the learned representations appears to be dictated by the neural scaling laws as a function of the number of model parameters and observations, while imposing upper bounds on the model performance by the amount of available data and compute, which is costly. In this study, we attempt to render the training of LLMs for program synthesis more efficient by unifying four key components: (1) model architectures, (2) learning methods, (3) infill sampling, and, (4) data distributions. Specifically, for the model architecture, we attempt to unify encoder and decoder-based models into a single prefix-LM. For learning methods, (i) causal language modeling, (ii) span corruption, (iii) infilling are unified into a simple learning algorithm. For infill sampling, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored. We conduct a comprehensive series of empirical experiments on 1B LLMs, for which failures and successes of this exploration are distilled into five lessons. We will provide a final recipe for training and release CodeGen2 models in size 1B, 3.7B, 7B, and, 16B parameters, along with the training framework as open-source: https://github.com/salesforce/CodeGen.

238 Citations

19 Influential

90 Altmetric

726.0 Score

Original PDF

5,176

AI Analysis

Korean Summary

이 논문은 프로그램 합성 및 이해를 위한 대규모 언어 모델(LLM)의 학습 과정을 효율화하기 위해 모델 아키텍처, 학습 알고리즘, 샘플링 절차, 데이터 분포의 네 가지 핵심 요소를 통합하고 검증한 연구입니다. 연구진은 Prefix-LM 아키텍처와 Infilling 학습의 효율성(Free Lunch 가설)을 실험했으나, 복잡한 아키텍처보다는 표준적인 Causal Decoder에 Causal Language Modeling과 Span Corruption을 단순 혼합한 목적 함수가 더 효과적임을 발견했습니다. 이러한 교훈(Lessons)을 바탕으로 자연어와 프로그래밍 언어를 혼합하여 학습하고, 다중 에포크(Multi-epoch) 학습의 유효성을 입증한 CodeGen2 및 CodeGen2.5 모델 제품군을 공개했습니다.

Key Innovations

Causal Language Modeling과 Span Corruption을 결합한 단순 혼합 목적 함수 제안
파일 경계를 고려한 파일 단위(File-level) Span Corruption 전략
자연어(NL)와 프로그래밍 언어(PL) 데이터의 혼합(Mix) 학습 레시피
데이터 반복 학습(Multi-epoch)이 모델 성능 향상에 기여함을 입증 (CodeGen2.5)
Prefix-LM 아키텍처와 Infilling 'Free Lunch' 가설에 대한 실증적 검증 및 한계 제시

Learning & Inference Impact

학습 과정에서는 복잡한 Prefix-LM 대신 표준 Causal Decoder를 채택하여 구조를 단순화하면서도, 목적 함수에 Span Corruption을 추가하여 모델이 문맥을 더 깊이 이해하도록 유도했습니다. 특히 자연어와 코드 데이터를 혼합하고, 제한된 데이터를 반복 학습(Multi-epoch)해도 성능이 지속적으로 향상됨을 보여주어 데이터 효율성을 높였습니다. 추론 과정에서는 이러한 학습 방식을 통해 단일 모델이 일반적인 코드 생성(Left-to-right generation)뿐만 아니라 중간 코드를 채우는 인필링(Infilling) 작업까지 수행할 수 있는 범용성을 갖추게 되었습니다.

Technical Difficulty

중급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!