2501.12599 Jan 22, 2025 cs.AI

Kimi k1.5: LLM을 활용한 강화학습 확장

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Zhilin Yang

Citations: 35,150

h-index: 42

Kimi Team

Citations: 959

h-index: 1

Angang Du

Citations: 1,464

h-index: 3

Bofei Gao

Citations: 1,456

h-index: 6

Bowei Xing

Citations: 1,242

h-index: 2

Cheng Chen

Citations: 1,553

h-index: 8

Cheng Li

Citations: 1,480

h-index: 5

Chenjun Xiao

Citations: 1,626

h-index: 5

Chenzhuang Du

Citations: 2,457

h-index: 8

Chonghua Liao

Citations: 961

h-index: 2

Chuning Tang

Citations: 962

h-index: 2

Congcong Wang

Citations: 1,451

h-index: 3

Dehao Zhang

Citations: 2,331

h-index: 8

Enzhe Lu

Citations: 2,277

h-index: 8

Feng Tang

Citations: 959

h-index: 1

Flood Sung

Citations: 2,069

h-index: 10

Guangda Wei

Citations: 1,238

h-index: 2

Guokun Lai

Citations: 2,280

h-index: 8

Haiqing Guo

Citations: 1,297

h-index: 4

Han Zhu

Citations: 1,404

h-index: 4

Haochen Ding

Citations: 1,536

h-index: 5

Hao-Xing Hu

Citations: 1,783

h-index: 6

Haoming Yang

Citations: 962

h-index: 2

Hao Zhang

Citations: 1,473

h-index: 4

Haotian Yao

Citations: 1,839

h-index: 5

Hao-Dong Zhao

Citations: 1,261

h-index: 3

Haoyu Lu

Citations: 1,492

h-index: 5

Hongcheng Gao

Tsinghua University

Citations: 2,823

h-index: 20

Huabin Zheng

Citations: 2,173

h-index: 6

Huan Yuan

Citations: 974

h-index: 2

Jia Chen

Citations: 1,185

h-index: 4

Jianling Su

Citations: 1,773

h-index: 7

Jianzhou Wang

Citations: 1,808

h-index: 5

Jie Zhao

Citations: 976

h-index: 2

Jin Zhang

Citations: 961

h-index: 1

Jingyuan Liu

Citations: 964

h-index: 2

Junjie Yan

Citations: 2,333

h-index: 9

Junyan Wu

Citations: 1,453

h-index: 3

Li-Na Shi

Citations: 1,288

h-index: 3

Li-tao Ye

Citations: 970

h-index: 2

Long Yu

Citations: 2,475

h-index: 9

Meng-xiao Dong

Citations: 1,337

h-index: 3

Neo Y. Zhang

Citations: 1,120

h-index: 2

Qi Pan

Citations: 990

h-index: 2

Shaowei Liu

Citations: 2,783

h-index: 10

Shen Ma

Citations: 1,315

h-index: 4

Shu-Yan Wei

Citations: 982

h-index: 3

S. Cao

Citations: 1,259

h-index: 4

Si-Da Huang

Citations: 997

h-index: 2

Tao Jiang

Citations: 1,510

h-index: 5

Wei-Wei Gao

Citations: 977

h-index: 2

Weiming Xiong

Citations: 1,499

h-index: 4

Weiran He

Citations: 2,874

h-index: 12

Weixiao Huang

Citations: 1,885

h-index: 8

Wenhao Wu

Citations: 1,555

h-index: 5

Wen He

Citations: 1,301

h-index: 4

Xian-sen Wei

Citations: 963

h-index: 2

Xian-Xian Jia

Citations: 971

h-index: 2

Xingzhe Wu

Citations: 1,547

h-index: 5

Xinran Xu

Citations: 2,594

h-index: 11

Xinxing Zu

Citations: 1,773

h-index: 6

Xinyu Zhou

Citations: 1,016

h-index: 4

Xue-biao Pan

Citations: 974

h-index: 2

Y. Charles

Citations: 1,708

h-index: 7

Yang Li

Citations: 1,466

h-index: 4

Yan-Ling Hu

Citations: 962

h-index: 2

Yangyang Liu

Citations: 1,678

h-index: 5

Yanru Chen

Citations: 1,563

h-index: 6

Ye-Jia Wang

Citations: 1,322

h-index: 3

Yibo Liu

Citations: 2,091

h-index: 6

Yidao Qin

Citations: 1,501

h-index: 3

Yifeng Liu

Citations: 1,014

h-index: 3

Yingbo Yang

Citations: 970

h-index: 2

Yiping Bao

Citations: 1,776

h-index: 6

Yulun Du

Citations: 2,355

h-index: 10

Yuxin Wu

Citations: 1,975

h-index: 9

Yuzhi Wang

Citations: 2,484

h-index: 9

Zaida Zhou

Citations: 2,519

h-index: 9

Zhaoji Wang

Citations: 1,546

h-index: 6

Zhaowei Li

Citations: 2,019

h-index: 9

Zhengxin Zhu

Citations: 1,485

h-index: 5

Zheng Zhang

Citations: 2,072

h-index: 7

Zhexu Wang

Citations: 1,545

h-index: 7

Zhiqi Huang

Peking University

Citations: 2,158

h-index: 15

Zihao Huang

Citations: 1,565

h-index: 6

Ziya Xu

Citations: 959

h-index: 1

Zonghan Yang

Citations: 1,487

h-index: 4

Changjiu Jiang

Citations: 963

h-index: 2

Haoze Li

Citations: 990

h-index: 3

Hao Yu

Citations: 980

h-index: 2

Ning Ma

Citations: 986

h-index: 3

Qucheng Gong

Citations: 1,559

h-index: 8

Enming Yuan

Citations: 2,437

h-index: 12

Jianhang Guo

Citations: 1,251

h-index: 2

다음 토큰 예측을 통한 언어 모델 사전 학습은 연산 규모를 확장하는 데 효과적임이 입증되었으나, 사용 가능한 훈련 데이터의 양에 의해 제한을 받습니다. 강화학습(RL)의 확장은 인공지능의 지속적인 발전을 위한 새로운 축을 열어주며, 대규모 언어 모델(LLM)이 보상을 통한 탐색을 학습함으로써 스스로 훈련 데이터를 확장할 수 있다는 가능성을 제시합니다. 그러나 기존에 발표된 연구들은 경쟁력 있는 결과를 보여주지 못했습니다. 이러한 배경에서, 우리는 RL로 훈련된 최신 멀티모달 LLM인 Kimi k1.5의 훈련 사례를 보고하며, 여기에는 RL 훈련 기법, 멀티모달 데이터 구성법, 인프라 최적화가 포함됩니다. 긴 문맥(long context) 확장과 개선된 정책 최적화 방법은 우리 접근 방식의 핵심 요소로, 몬테카를로 트리 탐색, 가치 함수, 과정 보상 모델(process reward models)과 같은 복잡한 기술에 의존하지 않고도 단순하고 효과적인 RL 프레임워크를 구축합니다. 특히 우리 시스템은 다양한 벤치마크와 모달리티(예: AIME 77.5점, MATH 500 96.2점, Codeforces 상위 6%, MathVista 74.9점)에서 최첨단(SOTA) 추론 성능을 달성하여 OpenAI의 o1과 대등한 수준을 보입니다. 또한, 우리는 긴 사고 사슬(long-CoT) 기술을 활용하여 짧은 사고 사슬(short-CoT) 모델을 개선하는 효과적인 'long2short' 방법론을 제시합니다. 이는 최고 수준의 short-CoT 추론 결과(예: AIME 60.8점, MATH500 94.6점, LiveCodeBench 47.3점)를 기록하며, GPT-4o나 Claude Sonnet 3.5와 같은 기존 short-CoT 모델들을 큰 격차(최대 +550%)로 능가합니다.

Original Abstract

Language model pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that large language models (LLMs) can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).

968 Citations

60 Influential

21 Altmetric

1,193.0 Score

Original PDF

AI Analysis

Korean Summary

이 논문은 강화학습(RL)을 통해 추론 능력을 확장한 최신 멀티모달 LLM인 'Kimi k1.5'의 기술 보고서입니다. 연구진은 몬테카를로 트리 탐색(MCTS)이나 가치 함수(Value Function)와 같은 복잡한 기법 없이, 긴 문맥(Long Context) 확장과 개선된 정책 최적화 알고리즘만으로도 모델이 계획, 반성, 오류 수정을 수행할 수 있음을 입증했습니다. 특히 긴 사고 과정(Long-CoT)을 통해 얻은 성능을 짧은 문맥 모델로 전이시키는 'Long2Short' 방법론을 제시하여 추론 효율성을 극대화했습니다. 결과적으로 Kimi k1.5는 AIME, MATH, Codeforces 등의 벤치마크에서 OpenAI의 o1과 대등한 성능을 달성했습니다.

Key Innovations

긴 문맥(Long Context) 확장을 통한 강화학습(RL) 스케일링 (최대 128k 토큰)
MCTS나 프로세스 보상 모델 없이 정책 최적화만 활용하는 단순화된 RL 프레임워크
긴 추론 궤적을 효율적으로 처리하기 위한 부분 롤아웃(Partial Rollouts) 인프라 기술
Long-CoT 모델의 추론 능력을 짧은 모델로 전이하는 Long2Short 기법 (모델 병합, 최단 거부 샘플링 등)
부정적 그라디언트(Negative Gradients)를 포함한 변형된 온라인 미러 하강(Online Mirror Descent) 알고리즘 적용

Learning & Inference Impact

학습 과정에서는 '부분 롤아웃' 시스템을 도입하여 긴 시퀀스 생성을 분할 처리함으로써 메모리 제약을 극복하고 학습 효율을 높였으며, 커리큘럼 및 우선순위 샘플링을 통해 데이터 효율성을 최적화했습니다. 또한, 정답뿐만 아니라 오답에 대한 페널티(부정적 그라디언트)를 부여하여 모델의 학습 속도를 가속화했습니다. 추론 단계에서는 Long-CoT를 통해 복잡한 문제에 대해 깊이 있는 사고를 가능하게 했으며, 이를 다시 Long2Short 기법으로 압축하여 실제 서비스 시에는 적은 토큰 비용으로도 고성능의 추론이 가능하도록 만들었습니다.

Technical Difficulty

고급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!