2501.12948 Jan 22, 2025 cs.AI

DeepSeek-R1: 강화 학습을 통한 대규모 언어 모델(LLM)의 추론 능력 유도

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-AI
DeepSeek-AI
Citations: 8,627
h-index: 4
Daya Guo
Daya Guo
Citations: 26,135
h-index: 25
Dejian Yang
Dejian Yang
Citations: 11,133
h-index: 11
Haowei Zhang
Haowei Zhang
Citations: 8,985
h-index: 8
Jun-Mei Song
Jun-Mei Song
Citations: 14,450
h-index: 10
Ruoyu Zhang
Ruoyu Zhang
Citations: 8,276
h-index: 6
R. Xu
R. Xu
Citations: 14,278
h-index: 6
Qihao Zhu
Qihao Zhu
Citations: 16,093
h-index: 11
Shirong Ma
Shirong Ma
Citations: 9,725
h-index: 10
Peiyi Wang
Peiyi Wang
Citations: 8,437
h-index: 6
Xiaoling Bi
Xiaoling Bi
Citations: 8,245
h-index: 4
Xiaokang Zhang
Xiaokang Zhang
Citations: 8,279
h-index: 6
Xingkai Yu
Xingkai Yu
Citations: 10,388
h-index: 10
Yu Wu
Yu Wu
Citations: 8,721
h-index: 5
Z. F. Wu
Z. F. Wu
Citations: 8,546
h-index: 7
Zhibin Gou
Zhibin Gou
Tsinghua University
Citations: 10,696
h-index: 17
Zhihong Shao
Zhihong Shao
Citations: 17,328
h-index: 14
Zhuoshu Li
Zhuoshu Li
Citations: 8,958
h-index: 6
Ziyi Gao
Ziyi Gao
Citations: 8,256
h-index: 5
A. Liu
A. Liu
Citations: 9,821
h-index: 8
Bing Xue
Bing Xue
Citations: 8,260
h-index: 5
Bing-Li Wang
Bing-Li Wang
Citations: 10,519
h-index: 9
Bochao Wu
Bochao Wu
Citations: 8,258
h-index: 5
B. Feng
B. Feng
Citations: 8,243
h-index: 4
Chengda Lu
Chengda Lu
Citations: 8,310
h-index: 7
Chenggang Zhao
Chenggang Zhao
Citations: 10,249
h-index: 12
C. Deng
C. Deng
Citations: 10,818
h-index: 11
Chenyu Zhang
Chenyu Zhang
Citations: 8,042
h-index: 3
C. Ruan
C. Ruan
Citations: 12,769
h-index: 14
Damai Dai
Damai Dai
Citations: 9,596
h-index: 11
Deli Chen
Deli Chen
Citations: 10,752
h-index: 8
Dong-Li Ji
Dong-Li Ji
Citations: 8,246
h-index: 4
Erhang Li
Erhang Li
Citations: 8,919
h-index: 5
Fangyun Lin
Fangyun Lin
Citations: 8,923
h-index: 5
Fucong Dai
Fucong Dai
Citations: 8,243
h-index: 4
Fuli Luo
Fuli Luo
Citations: 11,446
h-index: 10
Guangbo Hao
Guangbo Hao
Citations: 8,935
h-index: 6
Guanting Chen
Guanting Chen
Citations: 10,407
h-index: 7
Guowei Li
Guowei Li
Citations: 8,932
h-index: 6
H. Zhang
H. Zhang
Citations: 8,242
h-index: 4
Han Bao
Han Bao
Citations: 8,055
h-index: 4
Hanwei Xu
Hanwei Xu
Citations: 10,053
h-index: 9
Haocheng Wang
Haocheng Wang
Citations: 8,387
h-index: 8
Honghui Ding
Honghui Ding
Citations: 8,940
h-index: 6
Huajian Xin
Huajian Xin
Citations: 9,809
h-index: 11
Huazuo Gao
Huazuo Gao
Citations: 10,946
h-index: 12
Hui Qu
Hui Qu
Citations: 8,029
h-index: 3
Hui Li
Hui Li
Citations: 8,257
h-index: 5
Jianzhong Guo
Jianzhong Guo
Citations: 8,946
h-index: 6
Jiashi Li
Jiashi Li
Citations: 10,103
h-index: 10
Jiawei Wang
Jiawei Wang
Citations: 5,360
h-index: 3
JingChang Chen
JingChang Chen
Citations: 8,248
h-index: 4
Jingyang Yuan
Jingyang Yuan
Citations: 8,540
h-index: 7
Junjie Qiu
Junjie Qiu
Citations: 8,941
h-index: 6
Junlong Li
Junlong Li
Citations: 8,274
h-index: 5
J. Cai
J. Cai
Citations: 5,349
h-index: 1
J. Ni
J. Ni
Citations: 8,244
h-index: 4
Jian Liang
Jian Liang
Citations: 8,245
h-index: 4
Jin Chen
Jin Chen
Citations: 8,258
h-index: 5
Kai Dong
Kai Dong
Citations: 11,576
h-index: 10
Kai Hu
Kai Hu
Citations: 8,545
h-index: 6
Kaige Gao
Kaige Gao
Citations: 8,920
h-index: 5
Kang Guan
Kang Guan
Citations: 9,815
h-index: 8
Kexin Huang
Kexin Huang
Citations: 8,281
h-index: 8
K. Yu
K. Yu
Citations: 8,250
h-index: 5
Lean Wang
Lean Wang
Citations: 8,537
h-index: 6
Lecong Zhang
Lecong Zhang
Citations: 8,918
h-index: 5
Liang Zhao
Liang Zhao
Citations: 9,100
h-index: 8
Litong Wang
Litong Wang
Citations: 8,280
h-index: 6
Liyue Zhang
Liyue Zhang
Citations: 9,683
h-index: 10
Lei Xu
Lei Xu
Citations: 8,313
h-index: 6
Leyi Xia
Leyi Xia
Citations: 8,242
h-index: 4
Mingchuan Zhang
Mingchuan Zhang
Citations: 13,757
h-index: 7
Minghua Zhang
Minghua Zhang
Citations: 8,246
h-index: 4
M. Tang
M. Tang
Citations: 5,399
h-index: 4
Meng Li
Meng Li
Citations: 8,244
h-index: 4
Miaojun Wang
Miaojun Wang
Citations: 8,247
h-index: 4
Mingming Li
Mingming Li
Citations: 8,251
h-index: 5
Ning Tian
Ning Tian
Citations: 8,266
h-index: 6
Panpan Huang
Panpan Huang
Citations: 9,503
h-index: 8
Peng Zhang
Peng Zhang
Citations: 8,268
h-index: 6
Qiancheng Wang
Qiancheng Wang
Citations: 8,350
h-index: 6
Qinyu Chen
Qinyu Chen
Citations: 8,658
h-index: 6
Qiushi Du
Qiushi Du
Citations: 9,496
h-index: 9
Ruiqi Ge
Ruiqi Ge
Citations: 8,945
h-index: 6
Ruisong Zhang
Ruisong Zhang
Citations: 8,242
h-index: 4
Ruizhe Pan
Ruizhe Pan
Citations: 8,317
h-index: 5
Runji Wang
Runji Wang
Citations: 8,242
h-index: 4
R. J. Chen
R. J. Chen
Citations: 8,242
h-index: 4
R. Jin
R. Jin
Citations: 8,246
h-index: 4
Ruyi Chen
Ruyi Chen
Citations: 8,270
h-index: 6
Shanghao Lu
Shanghao Lu
Citations: 8,945
h-index: 7
Shangyan Zhou
Shangyan Zhou
Citations: 9,007
h-index: 8
Shanhuang Chen
Shanhuang Chen
Citations: 8,939
h-index: 6
Shengfeng Ye
Shengfeng Ye
Citations: 8,070
h-index: 4
Shiyu Wang
Shiyu Wang
Citations: 8,236
h-index: 4
Shuiping Yu
Shuiping Yu
Citations: 8,965
h-index: 7
Shunfeng Zhou
Shunfeng Zhou
Citations: 8,926
h-index: 5
Shuting Pan
Shuting Pan
Citations: 8,230
h-index: 3
S. Li
S. Li
Citations: 8,230
h-index: 3
Shuang Zhou
Shuang Zhou
Citations: 8,298
h-index: 6
Shao-Kang Wu
Shao-Kang Wu
Citations: 5,569
h-index: 3
Tao Yun
Tao Yun
Citations: 8,230
h-index: 3
Tian Pei
Tian Pei
Citations: 8,967
h-index: 7
T. Sun
T. Sun
Citations: 8,243
h-index: 4
T. Wang
T. Wang
Citations: 8,027
h-index: 2
Wangding Zeng
Wangding Zeng
Citations: 9,637
h-index: 9
Wanjia Zhao
Wanjia Zhao
Stanford University
Citations: 8,414
h-index: 8
Wen Liu
Wen Liu
Citations: 9,688
h-index: 8
W. Liang
W. Liang
Citations: 11,855
h-index: 12
Wenjun Gao
Wenjun Gao
Citations: 9,644
h-index: 10
Wen-Xia Yu
Wen-Xia Yu
Citations: 5,349
h-index: 1
Wentao Zhang
Wentao Zhang
Citations: 5,608
h-index: 4
W. Xiao
W. Xiao
Citations: 8,201
h-index: 5
Wei An
Wei An
Citations: 8,263
h-index: 5
Xiaodong Liu
Xiaodong Liu
Citations: 8,974
h-index: 7
Xiaohan Wang
Xiaohan Wang
Citations: 8,249
h-index: 5
Xiaokang Chen
Xiaokang Chen
Citations: 9,703
h-index: 9
X. Nie
X. Nie
Citations: 8,950
h-index: 7
Xin Cheng
Xin Cheng
Citations: 9,434
h-index: 11
Xin Liu
Xin Liu
Citations: 8,078
h-index: 5
Xin Xie
Xin Xie
Citations: 9,816
h-index: 8
Xingchao Liu
Xingchao Liu
Citations: 9,758
h-index: 9
Xinyu Yang
Xinyu Yang
Citations: 8,244
h-index: 4
Xinyuan Li
Xinyuan Li
Citations: 8,242
h-index: 4
Xuecheng Su
Xuecheng Su
Citations: 8,944
h-index: 6
Xuheng Lin
Xuheng Lin
Citations: 8,242
h-index: 4
X. Q. Li
X. Q. Li
Citations: 8,230
h-index: 3
Xiangyu Jin
Xiangyu Jin
Citations: 8,379
h-index: 6
Xi-Cheng Shen
Xi-Cheng Shen
Citations: 8,061
h-index: 4
Xiaosha Chen
Xiaosha Chen
Citations: 8,329
h-index: 5
Xiaowen Sun
Xiaowen Sun
Citations: 8,250
h-index: 4
Xiaoxiang Wang
Xiaoxiang Wang
Citations: 8,026
h-index: 2
Xinnan Song
Xinnan Song
Citations: 8,230
h-index: 3
Xinyi Zhou
Xinyi Zhou
Citations: 8,236
h-index: 3
Xianzu Wang
Xianzu Wang
Citations: 8,242
h-index: 4
Xinxia Shan
Xinxia Shan
Citations: 8,027
h-index: 2
Y. K. Li
Y. K. Li
Citations: 15,701
h-index: 7
Y. Q. Wang
Y. Q. Wang
Citations: 8,230
h-index: 3
Y. X. Wei
Y. X. Wei
Citations: 8,300
h-index: 3
Yang Zhang
Yang Zhang
Citations: 8,116
h-index: 4
Yanhong Xu
Yanhong Xu
Citations: 8,927
h-index: 5
Yao Li
Yao Li
Citations: 17,134
h-index: 7
Yao Zhao
Yao Zhao
Citations: 8,930
h-index: 5
Yaofeng Sun
Yaofeng Sun
Citations: 10,176
h-index: 8
Yaohui Wang
Yaohui Wang
Citations: 9,388
h-index: 7
Yi Yu
Yi Yu
Citations: 8,025
h-index: 2
Yichao Zhang
Yichao Zhang
Citations: 8,263
h-index: 5
Yifan Shi
Yifan Shi
Citations: 8,230
h-index: 3
Yi Xiong
Yi Xiong
Citations: 8,277
h-index: 5
Ying He
Ying He
Citations: 8,322
h-index: 6
Y. Piao
Y. Piao
Citations: 9,809
h-index: 7
Yisong Wang
Yisong Wang
Citations: 8,833
h-index: 6
Yixuan Tan
Yixuan Tan
Citations: 8,268
h-index: 5
Yiyang Ma
Yiyang Ma
Citations: 5,555
h-index: 2
Yiyuan Liu
Yiyuan Liu
Citations: 8,926
h-index: 5
Yongqiang Guo
Yongqiang Guo
Citations: 8,274
h-index: 5
Y. Ou
Y. Ou
Citations: 8,230
h-index: 3
Yuduan Wang
Yuduan Wang
Citations: 8,231
h-index: 3
Yue Gong
Yue Gong
Citations: 8,326
h-index: 6
Yu-Jing Zou
Yu-Jing Zou
Citations: 5,353
h-index: 2
Yujia He
Yujia He
Citations: 8,028
h-index: 2
Yunfan Xiong
Yunfan Xiong
Citations: 8,300
h-index: 6
Yu-Wei Luo
Yu-Wei Luo
Citations: 8,256
h-index: 5
Yu-mei You
Yu-mei You
Citations: 9,779
h-index: 6
Yuxuan Liu
Yuxuan Liu
Citations: 8,441
h-index: 6
Yuyang Zhou
Yuyang Zhou
Citations: 8,231
h-index: 3
Y. X. Zhu
Y. X. Zhu
Citations: 8,230
h-index: 3
Yanping Huang
Yanping Huang
Citations: 8,304
h-index: 5
Yi Zheng
Yi Zheng
Citations: 8,030
h-index: 3
Yuchen Zhu
Yuchen Zhu
Citations: 8,233
h-index: 4
Yunxiang Ma
Yunxiang Ma
Citations: 8,261
h-index: 4
Ying Tang
Ying Tang
Citations: 8,231
h-index: 3
Y. Zha
Y. Zha
Citations: 8,229
h-index: 3
Yuting Yan
Yuting Yan
Citations: 8,151
h-index: 5
Z. Ren
Z. Ren
Citations: 17,190
h-index: 10
Zhangli Sha
Zhangli Sha
Citations: 8,925
h-index: 5
Zhe Fu
Zhe Fu
Citations: 9,087
h-index: 7
Zhean Xu
Zhean Xu
Citations: 8,249
h-index: 4
Zhenda Xie
Zhenda Xie
Citations: 14,098
h-index: 15
Zhen-guo Zhang
Zhen-guo Zhang
Citations: 8,240
h-index: 4
Zhewen Hao
Zhewen Hao
Citations: 9,322
h-index: 6
Zhicheng Ma
Zhicheng Ma
Citations: 8,230
h-index: 3
Zhigang Yan
Zhigang Yan
Citations: 8,227
h-index: 3
Zhiyu Wu
Zhiyu Wu
Citations: 9,690
h-index: 8
Zihui Gu
Zihui Gu
Citations: 8,656
h-index: 5
Zijia Zhu
Zijia Zhu
Citations: 8,227
h-index: 3
Zijun Liu
Zijun Liu
Tsinghua University
Citations: 8,699
h-index: 9
Zi-An Li
Zi-An Li
Citations: 8,033
h-index: 3
Ziwei Xie
Ziwei Xie
Citations: 8,925
h-index: 5
Ziyang Song
Ziyang Song
Citations: 8,040
h-index: 3
Zizheng Pan
Zizheng Pan
Citations: 9,668
h-index: 7
Zhen Huang
Zhen Huang
Citations: 8,287
h-index: 6
Zhipeng Xu
Zhipeng Xu
Citations: 8,227
h-index: 3
Zhongyu Zhang
Zhongyu Zhang
Citations: 8,253
h-index: 5
Zhen Zhang
Zhen Zhang
Citations: 8,244
h-index: 4

일반적인 추론 능력은 인공지능 분야에서 오랫동안 해결하기 어려운 난제였다. 대규모 언어 모델(LLM)과 생각의 사슬(Chain-of-Thought) 프롬프팅으로 대표되는 최근의 기술적 혁신은 기초적인 추론 작업에서 상당한 성공을 거두었다. 그러나 이러한 성공은 사람이 직접 주석을 단 방대한 시연 데이터에 크게 의존하고 있으며, 더 복잡한 문제에 대한 모델의 능력은 여전히 부족하다. 본 연구에서는 사람이 라벨링한 추론 궤적 없이도 순수 강화 학습(RL)을 통해 LLM의 추론 능력을 유도할 수 있음을 보여준다. 제안된 RL 프레임워크는 자기 성찰, 검증, 동적 전략 적응과 같은 고도화된 추론 패턴의 창발적 발전을 촉진한다. 결과적으로 훈련된 모델은 수학, 코딩 대회, STEM 분야 등 검증 가능한 작업에서 우수한 성능을 달성하였으며, 사람의 시연을 기반으로 한 기존의 지도 학습 방식으로 훈련된 모델들의 성능을 능가했다. 또한, 이러한 대규모 모델에서 나타난 창발적 추론 패턴은 더 작은 모델들의 추론 능력을 지도하고 향상시키는 데 체계적으로 활용될 수 있다.

Original Abstract

General reasoning represents a long-standing and formidable challenge in artificial intelligence. Recent breakthroughs, exemplified by large language models (LLMs) and chain-of-thought prompting, have achieved considerable success on foundational reasoning tasks. However, this success is heavily contingent upon extensive human-annotated demonstrations, and models' capabilities are still insufficient for more complex problems. Here we show that the reasoning abilities of LLMs can be incentivized through pure reinforcement learning (RL), obviating the need for human-labeled reasoning trajectories. The proposed RL framework facilitates the emergent development of advanced reasoning patterns, such as self-reflection, verification, and dynamic strategy adaptation. Consequently, the trained model achieves superior performance on verifiable tasks such as mathematics, coding competitions, and STEM fields, surpassing its counterparts trained via conventional supervised learning on human demonstrations. Moreover, the emergent reasoning patterns exhibited by these large-scale models can be systematically harnessed to guide and enhance the reasoning capabilities of smaller models.

5336 Citations
1001 Influential
12.5 Altmetric
7,400.5 Score

AI Analysis

Korean Summary

이 논문은 대규모 언어 모델(LLM)의 추론 능력을 극대화하기 위해 순수 강화학습(Reinforcement Learning)을 활용한 DeepSeek-R1-Zero와 DeepSeek-R1 모델을 제안합니다. 연구진은 지도 학습(SFT) 없이 강화학습만으로도 모델이 자아 성찰(self-reflection) 및 검증과 같은 복잡한 추론 패턴을 스스로 학습할 수 있음을 증명했습니다. 그러나 초기 모델인 R1-Zero의 가독성 및 언어 혼합 문제를 해결하기 위해, 소량의 고품질 데이터를 활용한 '콜드 스타트(Cold Start)'와 다단계 학습 파이프라인을 도입하여 최종적으로 DeepSeek-R1을 완성했습니다. 이 모델은 수학, 코딩 등 복잡한 추론 작업에서 OpenAI의 o1-1217과 대등한 성능을 달성했으며, 이러한 강력한 추론 능력을 소형 모델에 효과적으로 증류(Distillation)할 수 있음을 보여주었습니다.

Key Innovations

  • GRPO(Group Relative Policy Optimization): 기존 PPO 대비 가치 모델(Value Model)을 제거하여 메모리 효율을 높이고 학습을 간소화한 강화학습 알고리즘 도입
  • 순수 강화학습을 통한 추론 발현: 인간의 개입 없이 규칙 기반 보상만으로 긴 사고 과정(Long Chain-of-Thought)과 자가 수정 능력이 자연스럽게 등장함
  • 다단계 훈련 파이프라인: Cold Start SFT, 추론 중심 RL, 일반 SFT, 모든 시나리오를 위한 RL을 결합하여 성능과 사용자 친화성을 동시에 확보
  • 지식 증류(Distillation): DeepSeek-R1의 추론 패턴을 학습 데이터로 사용하여 소형 모델(1.5B~70B)의 성능을 획기적으로 향상시킴

Learning & Inference Impact

학습 과정에서는 GRPO 알고리즘을 통해 비평가(Critic) 모델 없이 그룹 단위의 상대적 우위를 계산함으로써 대규모 모델의 강화학습 비용을 절감했습니다. 추론 과정(Test-time)에서는 모델이 문제의 난이도에 따라 사고 토큰(thinking tokens)의 길이를 동적으로 조절하며, 문제를 해결하기 위해 스스로 수천 토큰 이상의 중간 추론 단계를 생성합니다. 이는 모델이 답변을 내기 전에 스스로 오류를 검증하고 전략을 수정하는 '테스트 시간 컴퓨팅(Test-time Compute)' 확장을 가능하게 하여 복잡한 문제 해결 능력을 비약적으로 상승시켰습니다.

Technical Difficulty

고급

Estimated implementation complexity based on methodology.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!