2602.11089v1 Feb 11, 2026 cs.CL

DataChef: 강화 학습을 통한 LLM 적응을 위한 최적의 데이터 레시피 생성

DataChef: Cooking Up Optimal Data Recipes for LLM Adaptation via Reinforcement Learning

Kai Chen

Citations: 443

h-index: 4

Yicheng Chen

Citations: 85

h-index: 3

Zerun Ma

Citations: 487

h-index: 5

Xinchen Xie

Citations: 29

h-index: 1

Yining Li

Citations: 39

h-index: 3

최근 대규모 언어 모델(LLM)의 성능 향상은 고품질의 대규모 학습 데이터 확보에 크게 의존합니다. 데이터 레시피는 원시 데이터를 학습용 데이터로 변환하는 데이터 처리 파이프라인을 포함하며, 모델 성능에 중요한 영향을 미칩니다. 개별 데이터 처리 단계, 예를 들어 데이터 합성 및 필터링은 LLM을 사용하여 자동화되는 경우가 많지만, 전체 데이터 레시피 설계는 여전히 대부분 수동적이며 많은 시간과 노력이 필요하며, 상당한 수준의 전문 지식과 반복적인 작업이 요구됩니다. 이러한 격차를 해소하기 위해, 본 연구에서는 LLM 적응을 위한 extit{종단 간 데이터 레시피 생성} 방법을 제안합니다. 특정 벤치마크와 사용 가능한 데이터 소스가 주어지면, 모델은 기본 LLM을 목표 작업에 적응시키는 완전한 데이터 레시피를 생성해야 합니다. 우리는 온라인 강화 학습을 수행하며, 후보 레시피의 하위 작업 성능을 예측하는 프록시 보상을 사용하는 DataChef-32B를 제시합니다. 여섯 가지의 독립적인 작업에서 DataChef-32B는 인간 전문가가 선별한 레시피와 유사한 수준의 성능을 보이는 실용적인 레시피를 생성합니다. 특히, DataChef-32B에서 생성된 레시피는 Qwen3-1.7B-Base 모델을 수학 분야에 적응시켜 AIME'25에서 66.7의 점수를 달성했으며, 이는 Qwen3-1.7B보다 우수한 성능입니다. 본 연구는 LLM 학습 자동화 및 자기 진화형 AI 시스템 개발에 새로운 시각을 제시합니다.

Original Abstract

In the current landscape of Large Language Models (LLMs), the curation of large-scale, high-quality training data is a primary driver of model performance. A key lever is the \emph{data recipe}, which comprises a data processing pipeline to transform raw sources into training corpora. Despite the growing use of LLMs to automate individual data processing steps, such as data synthesis and filtering, the overall design of data recipes remains largely manual and labor-intensive, requiring substantial human expertise and iteration. To bridge this gap, we formulate \emph{end-to-end data recipe generation} for LLM adaptation. Given a target benchmark and a pool of available data sources, a model is required to output a complete data recipe that adapts a base LLM to the target task. We present DataChef-32B, which performs online reinforcement learning using a proxy reward that predicts downstream performance for candidate recipes. Across six held-out tasks, DataChef-32B produces practical recipes that reach comparable downstream performance to those curated by human experts. Notably, the recipe from DataChef-32B adapts Qwen3-1.7B-Base to the math domain, achieving 66.7 on AIME'25 and surpassing Qwen3-1.7B. This work sheds new light on automating LLM training and developing self-evolving AI systems.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!