2604.00715v1 Apr 01, 2026 cs.CL

기억할 것인가, 검색할 것인가: RAG(Retrieval-Augmented Generation)에 적합한 사전 훈련을 위한 확장 법칙

To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

Michael Yu

Citations: 367

h-index: 4

Karan Singh

Citations: 9

h-index: 1

Varun Gangal

Citations: 332

h-index: 8

Zhuofu Tao

Citations: 9

h-index: 2

Sachin Kumar

Citations: 30

h-index: 2

Emmy Liu

Citations: 32

h-index: 3

Steven Y. Feng

University of Waterloo

Citations: 1,576

h-index: 10

검색 기반 증강 생성(RAG)은 테스트 시 관련 문맥을 제공하여 지식 집약적인 상황에서 언어 모델(LM)의 성능을 향상시킵니다. 그러나 사전 훈련 중에 획득된 매개변수 지식과 검색을 통해 접근되는 비매개변수 지식 간의 관계는 특히 고정된 데이터 예산 하에서 잘 이해되지 않습니다. 본 연구에서는 다양한 모델 및 데이터 크기에서 사전 훈련 코퍼스 크기와 검색 저장소 크기 간의 균형을 체계적으로 연구합니다. 우리는 30M에서 3B개의 매개변수를 가진 OLMo-2 기반 LM을 최대 100B개의 DCLM 데이터로 훈련하고, 사전 훈련 데이터 크기(매개변수 수의 1~150배)와 검색 저장소 크기(1~20배)를 모두 변경하며, 추론, 과학 QA 및 개방형 도메인 QA를 포괄하는 다양한 벤치마크에서 성능을 평가합니다. 연구 결과, 검색은 모델 크기 전반에 걸쳐 매개변수만 사용하는 기준 모델보다 성능을 꾸준히 향상시키며, 모델 크기, 사전 훈련 토큰 수 및 검색 코퍼스 크기를 함수로 모델링하는 3차원 확장 프레임워크를 제시합니다. 이 확장 공간은 고정된 데이터 예산을 사전 훈련과 검색 간에 최적으로 할당하는 데 도움이 되며, 검색의 한계 효용은 모델 크기, 작업 유형 및 사전 훈련 포화 정도에 따라 크게 달라짐을 보여줍니다. 본 연구 결과는 검색이 언제, 어떻게 사전 훈련을 보완해야 하는지에 대한 정량적인 기반을 제공하며, 확장 가능한 언어 모델링 시스템 설계 시 데이터 리소스를 할당하는 데 실질적인 지침을 제공합니다.

Original Abstract

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over parametric-only baselines across model scales and introduce a three-dimensional scaling framework that models performance as a function of model size, pretraining tokens, and retrieval corpus size. This scaling manifold enables us to estimate optimal allocations of a fixed data budget between pretraining and retrieval, revealing that the marginal utility of retrieval depends strongly on model scale, task type, and the degree of pretraining saturation. Our results provide a quantitative foundation for understanding when and how retrieval should complement pretraining, offering practical guidance for allocating data resources in the design of scalable language modeling systems.

1 Citations

0 Influential

5 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!