2604.08423v1 Apr 09, 2026 cs.CL

차등 가능한 목표에 대한 합성 데이터 생성

Synthetic Data for any Differentiable Target

Tatsunori Hashimoto

Citations: 285

h-index: 5

Tristan Thrush

Citations: 5,458

h-index: 18

Herman Brunborg

Citations: 42

h-index: 2

Luke Bailey

Citations: 320

h-index: 3

Marcel Roed

Citations: 356

h-index: 4

Neil Band

Citations: 157

h-index: 4

Christopher Potts

Citations: 102

h-index: 3

Sung Min Park

Citations: 833

h-index: 10

합성 학습 데이터를 통해 언어 모델을 얼마나 효과적으로 제어할 수 있을까요? 본 연구에서는 Dataset Policy Gradient (DPG)라는 강화 학습(RL) 방법을 개발하여, 목표 예제를 포함하는 데이터셋을 생성하도록 합성 데이터 생성기를 정확하게 최적화합니다. 이 방법은 대상 모델의 지도 학습(SFT)에 사용될 때, 사용자가 선택한 차등 가능한 지표에서 대상 모델의 성능을 향상시킵니다. 저희 접근 방식은 고차 미분을 통해 정확한 데이터 기여도를 파악하고, 이러한 점수를 정책 그래디언트 보상으로 활용합니다. 이 절차는 합성 데이터 생성기에 대한 실제 그래디언트를 매우 정확하게 근사함을 증명합니다. DPG의 잠재력을 보여주기 위해, 생성된 예제를 사용한 SFT만으로 대상 모델의 언어 모델 헤드 가중치를 (1) QR 코드를 포함하도록, (2) 패턴 $ exttt{67}$을 포함하도록, 그리고 (3) $\ell^2$ 노름을 낮추도록 만들 수 있음을 보여줍니다. 또한, 생성기가 (4) 입력을 새로운 언어로 재구성하고 (5) 특정 UUID를 생성하도록 만들 수 있음을 보여주는데, 이는 생성기의 입력 프롬프트에 명시적으로 포함되지 않은 목표입니다. 이러한 결과는 DPG가 합성 학습 예제만 사용하여 모델의 특성을 형성하는 강력하고 유연한 기술임을 시사합니다.

Original Abstract

What are the limits of controlling language models via synthetic training data? We develop a reinforcement learning (RL) primitive, the Dataset Policy Gradient (DPG), which can precisely optimize synthetic data generators to produce a dataset of targeted examples. When used for supervised fine-tuning (SFT) of a target model, these examples cause the target model to do well on a differentiable metric of our choice. Our approach achieves this by taking exact data attribution via higher-order gradients and using those scores as policy gradient rewards. We prove that this procedure closely approximates the true, intractable gradient for the synthetic data generator. To illustrate the potential of DPG, we show that, using only SFT on generated examples, we can cause the target model's LM head weights to (1) embed a QR code, (2) embed the pattern $\texttt{67}$, and (3) have lower $\ell^2$ norm. We additionally show that we can cause the generator to (4) rephrase inputs in a new language and (5) produce a specific UUID, even though neither of these objectives is conveyed in the generator's input prompts. These findings suggest that DPG is a powerful and flexible technique for shaping model properties using only synthetic training examples.

0 Citations

0 Influential

9 Altmetric

45.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!