2107.03374 Jul 07, 2021 cs.AI

코드로 학습된 대규모 언어 모델 평가

Evaluating Large Language Models Trained on Code

I. Sutskever

Citations: 533,973

h-index: 75

Mark Chen

Citations: 134,179

h-index: 16

Jerry Tworek

Citations: 47,935

h-index: 13

Heewoo Jun

Citations: 51,325

h-index: 15

Qiming Yuan

Citations: 12,338

h-index: 10

Henrique Pondé

Citations: 10,114

h-index: 5

Jared Kaplan

Citations: 30,844

h-index: 33

Harrison Edwards

Citations: 20,232

h-index: 13

Yura Burda

Citations: 13,585

h-index: 2

Nicholas Joseph

Citations: 25,080

h-index: 18

Greg Brockman

Citations: 55,611

h-index: 11

Alex Ray

Citations: 40,988

h-index: 7

Raul Puri

Citations: 14,250

h-index: 12

Gretchen Krueger

Citations: 148,153

h-index: 13

Michael Petrov

Citations: 41,751

h-index: 7

Heidy Khlaaf

Citations: 10,834

h-index: 13

G. Sastry

Citations: 147,155

h-index: 17

Pamela Mishkin

Citations: 114,392

h-index: 17

Brooke Chan

Citations: 37,516

h-index: 8

Scott Gray

Citations: 120,148

h-index: 14

Nick Ryder

Citations: 73,315

h-index: 34

Mikhail Pavlov

Citations: 46,868

h-index: 10

Alethea Power

Citations: 38,675

h-index: 9

Lukasz Kaiser

Citations: 235,445

h-index: 31

Mo Bavarian

Citations: 50,094

h-index: 12

Clemens Winter

Citations: 99,218

h-index: 12

P. Tillet

Citations: 46,002

h-index: 8

F. Such

Citations: 43,526

h-index: 17

D. Cummings

Citations: 10,036

h-index: 1

Matthias Plappert

Citations: 25,763

h-index: 15

Fotios Chantzis

Citations: 10,071

h-index: 4

Elizabeth Barnes

Citations: 13,118

h-index: 4

Ariel Herbert-Voss

Citations: 74,247

h-index: 9

William H. Guss

Citations: 10,788

h-index: 12

Alex Nichol

Citations: 51,520

h-index: 15

Igor Babuschkin

Citations: 36,551

h-index: 8

S. Balaji

Citations: 41,440

h-index: 7

Shantanu Jain

Citations: 11,963

h-index: 4

A. Carr

Citations: 10,070

h-index: 4

Jan Leike

Citations: 75,001

h-index: 30

Josh Achiam

Citations: 10,788

h-index: 5

Vedant Misra

Citations: 33,549

h-index: 13

Evan Morikawa

Citations: 35,295

h-index: 5

Alec Radford

Citations: 284,933

h-index: 33

M. Knight

Citations: 12,161

h-index: 25

Miles Brundage

Citations: 42,748

h-index: 21

M. Murati

Citations: 39,639

h-index: 6

Katie Mayer

Citations: 10,037

h-index: 1

Peter Welinder

Citations: 69,858

h-index: 17

Bob McGrew

Citations: 53,197

h-index: 14

Dario Amodei

Citations: 136,070

h-index: 30

Sam McCandlish

OpenAI

Citations: 96,060

h-index: 30

Wojciech Zaremba

Citations: 98,837

h-index: 31

우리는 GitHub에서 공개적으로 사용 가능한 코드로 미세 조정된 GPT 언어 모델인 Codex를 소개하고, 이 모델의 파이썬 코드 작성 능력을 연구한다. Codex의 별도 상용 버전은 GitHub Copilot을 구동한다. 독스트링(docstring)으로부터 프로그램을 합성하는 기능적 정확성을 측정하기 위해 우리가 공개한 새로운 평가 세트인 HumanEval에서, 우리 모델은 28.8%의 문제를 해결한 반면 GPT-3는 0%, GPT-J는 11.4%를 해결했다. 더 나아가, 우리는 모델에서 반복적으로 샘플링을 수행하는 것이 어려운 프롬프트에 대해 작동하는 솔루션을 생성하는 데 놀랍도록 효과적인 전략임을 발견했다. 이 방법을 사용하여 우리는 문제당 100개의 샘플로 70.2%의 문제를 해결했다. 모델을 면밀히 조사한 결과, 긴 연산 사슬을 설명하는 독스트링 처리와 변수에 연산을 바인딩하는 데 있어서의 어려움을 포함한 한계점이 드러났다. 마지막으로 우리는 안전, 보안 및 경제적 측면을 다루며 강력한 코드 생성 기술 배포가 가져올 잠재적이고 광범위한 영향에 대해 논의한다.

Original Abstract

We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

10101 Citations

1553 Influential

30 Altmetric

13,357.0 Score

Original PDF

AI Analysis

Korean Summary

이 논문은 OpenAI가 개발한 GPT-3 기반의 코드 생성 모델인 'Codex'를 소개하고 평가하는 연구입니다. GitHub의 공개 코드로 파인튜닝된 Codex는 자연어 독스트링(docstring)을 Python 코드로 변환하는 작업에서 뛰어난 성능을 보였습니다. 연구진은 기존의 텍스트 유사도 기반 평가(BLEU 등)가 코드의 기능적 정확성을 반영하지 못한다는 점을 지적하며, 단위 테스트를 통해 실행 가능성을 검증하는 새로운 벤치마크인 'HumanEval'을 제안했습니다. 실험 결과, 120억 파라미터의 Codex는 GPT-3보다 월등한 성능을 보였으며, 단일 샘플링보다 다중 샘플링(여러 개의 후보 생성 후 선택) 방식이 문제 해결률을 비약적으로 높인다는 사실을 입증했습니다. 또한 모델의 한계, 보안 위험, 그리고 사회적 영향에 대한 포괄적인 분석을 포함합니다.

Key Innovations

코드 생성 능력을 평가하기 위한 새로운 벤치마크 데이터셋 'HumanEval' 공개
단순 텍스트 매칭이 아닌 단위 테스트 통과 여부를 기반으로 하는 'pass@k' 평가 지표 및 비편향 추정량(unbiased estimator) 도입
독립형 함수(Standalone functions) 데이터를 활용한 지도 파인튜닝(Supervised Fine-tuning) 모델 'Codex-S' 개발
코드의 공백(whitespace) 특성을 반영하여 토큰 효율성을 30% 높인 확장된 토크나이저 적용
생성된 코드의 안전한 실행 및 평가를 위한 샌드박스(Sandbox) 환경 구축

Learning & Inference Impact

학습 측면에서는 자연어 모델인 GPT-3를 기반으로 하되, 방대한 GitHub 코드 데이터로 파인튜닝하여 프로그래밍 언어의 구조적 특성을 학습시켰습니다. 특히 공백 처리를 위한 토큰을 추가하여 학습 및 추론 속도를 최적화했습니다. 추론 측면에서는 '반복 샘플링(Repeated Sampling)' 전략의 유효성을 입증했는데, 단일 결과만 생성하는 것보다 높은 온도(temperature)로 100개의 샘플을 생성한 뒤 단위 테스트를 통과하는 코드를 찾는 방식이 성능(pass rate)을 28.8%에서 70.2%까지 끌어올릴 수 있음을 보여주었습니다. 이는 추론 비용을 늘리더라도 정확한 코드를 얻는 실용적인 접근법을 제시합니다.

Technical Difficulty

중급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!