2601.21343v2 Jan 29, 2026 cs.CL

자기 개선 사전 훈련: 사전 훈련된 모델을 활용하여 더 나은 모델을 사전 훈련하는 방법

Self-Improving Pretraining: using post-trained models to pretrain better models

Jing Xu

Citations: 1,729

h-index: 11

E. Tan

Citations: 142

h-index: 5

S. Dhuliawala

Citations: 1,713

h-index: 13

Ping Yu

Citations: 510

h-index: 11

Sainbayar Sukhbaatar

Citations: 217

h-index: 8

J. Weston

Citations: 3,068

h-index: 25

Olga Golovneva

Citations: 665

h-index: 11

대규모 언어 모델의 안전성, 사실성 및 전반적인 품질을 보장하는 것은 매우 중요한 과제이며, 특히 이러한 모델이 실제 응용 분야에 점점 더 많이 사용됨에 따라 더욱 중요해지고 있습니다. 이러한 문제를 해결하기 위한 일반적인 방법은 비용이 많이 드는, 신중하게 구성된 데이터 세트를 수집하고, 여러 단계의 미세 조정 및 정렬을 수행하는 것입니다. 그러나 이러한 복잡한 프로세스조차도 사전 훈련 중에 학습된 패턴을 완전히 수정할 수 있다는 보장을 제공하지 않습니다. 따라서 사전 훈련 단계에서 이러한 문제를 해결하는 것이 중요합니다. 왜냐하면 이는 모델의 핵심 동작을 결정하고, 위험하거나 환각적인 출력이 깊이 내재되는 것을 방지하기 때문입니다. 이 문제를 해결하기 위해, 우리는 문서를 스트리밍 방식으로 처리하고, 강화 학습(RL)을 사용하여 각 단계에서 생성되는 다음 K개의 토큰을 개선하는 새로운 사전 훈련 방법을 소개합니다. 강력하게 사전 훈련된 모델은 후보 생성 결과(모델 출력, 원래 접미사, 재작성된 접미사 포함)의 품질, 안전성 및 사실성을 평가합니다. 훈련 초기에는 원래 접미사와 재작성된 접미사가 사용되며, 모델이 개선됨에 따라 강화 학습을 통해 고품질 출력을 보상합니다. 이 방법은 더 높은 품질, 안전성 및 사실성을 갖춘 모델을 근본적으로 구축합니다. 실험 결과, 우리의 방법은 사실성과 안전성 측면에서 기존 사전 훈련 방식보다 각각 36.2%와 18.5%의 상대적 성능 향상을 보였으며, 전체 생성 품질 측면에서는 최대 86.3%의 개선 효과를 보였습니다.

Original Abstract

Ensuring safety, factuality and overall quality in the generations of large language models is a critical challenge, especially as these models are increasingly deployed in real-world applications. The prevailing approach to addressing these issues involves collecting expensive, carefully curated datasets and applying multiple stages of fine-tuning and alignment. However, even this complex pipeline cannot guarantee the correction of patterns learned during pretraining. Therefore, addressing these issues during pretraining is crucial, as it shapes a model's core behaviors and prevents unsafe or hallucinated outputs from becoming deeply embedded. To tackle this issue, we introduce a new pretraining method that streams documents and uses reinforcement learning (RL) to improve the next K generated tokens at each step. A strong, post-trained model judges candidate generations -- including model rollouts, the original suffix, and a rewritten suffix -- for quality, safety, and factuality. Early in training, the process relies on the original and rewritten suffixes; as the model improves, RL rewards high-quality rollouts. This approach builds higher quality, safer, and more factual models from the ground up. In experiments, our method gives 36.2% and 18.5% relative improvements over standard pretraining in terms of factuality and safety, and up to 86.3% win rate improvements in overall generation quality.

2 Citations

0 Influential

12.5 Altmetric

64.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!