2604.13977v1 Apr 15, 2026 cs.CL

고품질 사전 훈련 데이터는 어떻게 합성할 수 있는가? 프롬프트 설계, 생성 모델, 그리고 소스 데이터에 대한 체계적인 연구

How Can We Synthesize High-Quality Pretraining Data? A Systematic Study of Prompt Design, Generator Model, and Source Data

E. Beeching

Citations: 1,068

h-index: 7

Joel Niklaus

University of Bern, University of Fribourg

Citations: 926

h-index: 13

Atsuki Yamaguchi

University of Sheffield

Citations: 255

h-index: 8

Michal vStef'anik

Citations: 36

h-index: 4

Guilherme Penedo

Hugging Face

Citations: 2,942

h-index: 9

Hynek Kydl'ivcek

Citations: 240

h-index: 3

Elie Bakouch

Citations: 548

h-index: 6

Lewis Tunstall

Citations: 2,202

h-index: 9

Thibaud Frere

Citations: 22

h-index: 1

Colin Raffel

Citations: 2,508

h-index: 17

L. V. Werra

Citations: 8,439

h-index: 19

Thomas Wolf

Citations: 1,322

h-index: 6

합성 데이터는 대규모 언어 모델 훈련의 필수적인 요소이지만, 재구성 전략, 생성 모델, 그리고 소스 데이터를 포함한 다양한 설계 측면에 대한 체계적인 비교 연구는 부족합니다. 본 연구에서는 웹 텍스트를 합성된 사전 훈련 데이터로 재구성하는 과정에서 중요한 요인을 파악하기 위해 1조 개 이상의 토큰을 생성하는 광범위한 통제 실험을 수행했습니다. 연구 결과, 표, 수학 문제, FAQ, 튜토리얼과 같은 구조화된 출력 형식이, 선별된 웹 데이터 기반과 기존의 합성 방법보다 일관되게 더 우수한 성능을 보였습니다. 주목할 점은, 생성 모델의 크기를 10억 개 이상의 파라미터로 늘리는 것이 추가적인 성능 향상을 가져오지 않는다는 것입니다. 또한, 성능에 큰 영향을 미치는 것은 원본 데이터 선택입니다. 본 연구의 결과를 바탕으로, 재구성된 웹 텍스트로 구성된 4860억 토큰의 공개 데이터셋인 extbf{ extsc{FinePhrase}}를 개발했습니다. extsc{FinePhrase}는 기존의 모든 합성 데이터 기반보다 우수한 성능을 보이면서, 생성 비용을 최대 30배까지 절감할 수 있습니다. 본 연구에서는 데이터셋, 모든 프롬프트, 그리고 생성 프레임워크를 연구 커뮤니티에 제공합니다.

Original Abstract

Synthetic data is a standard component in training large language models, yet systematic comparisons across design dimensions, including rephrasing strategy, generator model, and source data, remain absent. We conduct extensive controlled experiments, generating over one trillion tokens, to identify critical factors in rephrasing web text into synthetic pretraining data. Our results reveal that structured output formats, such as tables, math problems, FAQs, and tutorials, consistently outperform both curated web baselines and prior synthetic methods. Notably, increasing the size of the generator model beyond 1B parameters provides no additional benefit. Our analysis also demonstrates that the selection of the original data used for mixing substantially influences performance. By applying our findings, we develop \textbf{\textsc{FinePhrase}}, a 486-billion-token open dataset of rephrased web text. We show that \textsc{FinePhrase} outperforms all existing synthetic data baselines while reducing generation costs by up to 30 times. We provide the dataset, all prompts, and the generation framework to the research community.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!