2203.02155 Mar 04, 2022 cs.AI

인간 피드백을 통한 지시 수행 언어 모델 학습

Training language models to follow instructions with human feedback

Alex Ray

Citations: 40,988

h-index: 7

Pamela Mishkin

Citations: 114,392

h-index: 17

Jan Leike

Citations: 75,001

h-index: 30

Peter Welinder

Citations: 69,858

h-index: 17

Yujia Liu

Citations: 0

h-index: 0

Amanda Askell

Citations: 146,014

h-index: 18

Jacob Hilton

Citations: 40,542

h-index: 9

Jeff Wu

Citations: 129,170

h-index: 11

Xu Jiang

Citations: 23,466

h-index: 2

John Schulman

Citations: 137,700

h-index: 45

S. Agarwal

Citations: 162,519

h-index: 20

M. Simens

Citations: 51,309

h-index: 9

Katarina Slama

Citations: 47,476

h-index: 11

Long Ouyang

Citations: 25,242

h-index: 6

Diogo Almeida

Citations: 23,258

h-index: 5

Carroll L. Wainwright

Citations: 21,610

h-index: 2

Fraser Kelton

Citations: 21,726

h-index: 7

Luke E. Miller

Citations: 21,621

h-index: 4

P. Christiano

Citations: 35,181

h-index: 16

Ryan J. Lowe

Citations: 36,809

h-index: 50

언어 모델의 크기를 키우는 것이 사용자의 의도를 따르는 능력을 본질적으로 향상시키는 것은 아닙니다. 예를 들어, 거대 언어 모델은 사실이 아니거나, 유해하거나, 혹은 단순히 사용자에게 도움이 되지 않는 결과물을 생성할 수 있습니다. 다시 말해, 이러한 모델들은 사용자와 정렬(aligned)되지 않은 것입니다. 본 논문에서는 인간 피드백을 이용한 미세 조정(fine-tuning)을 통해 다양한 작업에서 언어 모델을 사용자 의도에 정렬시키는 방법을 제시합니다. 라벨러가 작성한 프롬프트와 OpenAI API를 통해 제출된 프롬프트 세트로 시작하여, 원하는 모델 동작에 대한 라벨러의 시연 데이터셋을 수집하고, 이를 사용하여 지도 학습으로 GPT-3를 미세 조정합니다. 그 후 모델 출력에 대한 순위 데이터셋을 수집하고, 이를 사용하여 인간 피드백 기반 강화 학습(RLHF)으로 해당 지도 학습 모델을 추가 미세 조정합니다. 우리는 이렇게 만들어진 모델을 InstructGPT라고 부릅니다. 프롬프트 분포에 대한 인간 평가 결과, 13억(1.3B) 개의 매개변수를 가진 InstructGPT 모델의 출력은 100배 더 적은 매개변수를 가지고 있음에도 불구하고 1,750억(175B) 개의 매개변수를 가진 GPT-3의 출력보다 선호되었습니다. 또한, InstructGPT 모델은 공개 NLP 데이터셋에서의 성능 저하는 최소화하면서도 진실성은 향상되었고 유해한 출력 생성은 감소했습니다. 비록 InstructGPT가 여전히 단순한 실수를 범하기도 하지만, 우리의 결과는 인간 피드백을 이용한 미세 조정이 언어 모델을 인간의 의도에 정렬시키는 유망한 방향임을 보여줍니다.

Original Abstract

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

21719 Citations

2277 Influential

25 Altmetric

26,398.0 Score

Original PDF

AI Analysis

Korean Summary

이 논문은 대규모 언어 모델(LLM)이 사용자의 의도를 따르도록 정렬(Alignment)하기 위해 인간 피드백 기반 강화학습(RLHF)을 적용한 'InstructGPT'를 제안합니다. 단순히 다음 토큰을 예측하도록 학습된 GPT-3는 종종 유해하거나 사실이 아닌 정보를 생성하는 문제가 있었습니다. 이를 해결하기 위해 연구진은 (1) 인간이 작성한 모범 답안으로 지도 미세 조정(SFT), (2) 모델 출력 간의 서열을 매긴 데이터로 보상 모델(RM) 학습, (3) 보상 모델을 기반으로 PPO 알고리즘을 사용한 강화학습의 3단계 과정을 거쳤습니다. 결과적으로 1.3B 파라미터의 InstructGPT가 175B 파라미터의 GPT-3보다 인간 평가에서 선호되었으며, 사실성(Truthfulness)은 향상되고 독성(Toxicity)은 감소했습니다.

Key Innovations

인간 피드백 기반 강화학습(RLHF)을 대규모 언어 모델 튜닝에 체계적으로 적용
지도 미세 조정(SFT), 보상 모델(RM), PPO 강화학습으로 이어지는 3단계 정렬 파이프라인 정립
강화학습으로 인한 일반 자연어 처리 성능 저하(Alignment Tax)를 방지하기 위한 프리트레이닝 데이터 혼합(PPO-ptx) 기법 도입
모델 크기를 100배 키우는 것보다 인간 피드백을 통한 정렬이 사용자 선호도 향상에 훨씬 효율적임을 입증

Learning & Inference Impact

학습 과정에서는 인간 라벨러가 생성하거나 평가한 고품질 데이터셋 구축이 필수적이며, 보상 모델 훈련과 PPO 최적화라는 추가적인 단계가 필요해 훈련 파이프라인이 복잡해졌습니다. 그러나 거대 모델을 처음부터 다시 학습하는 비용 대비 미세 조정 비용은 상대적으로 저렴합니다. 추론 과정에서 모델 구조 자체는 GPT-3와 동일하므로 연산 비용은 변하지 않지만, 모델이 사용자의 지시(Instruction)를 즉각적으로 이해하고 수행하게 됨으로써 사용자가 복잡한 퓨샷(Few-shot) 예시를 제공해야 하는 프롬프트 엔지니어링의 부담이 획기적으로 줄어들었습니다.

Technical Difficulty

중급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!