2605.07063v1 May 08, 2026 cs.LG

Dr. Post-Training: LLM 후속 학습에 대한 데이터 정규화 관점

Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

Pingbang Hu

Citations: 80

h-index: 4

Xueshen Liu

Citations: 58

h-index: 4

Z. Mao

Citations: 0

h-index: 0

Jiaqi Ma

Citations: 5

h-index: 1

데이터 선택 방법은 LLM 후속 학습의 중요한 과제를 해결합니다. 즉, 제한적이지만 고품질의 목표 데이터와 풍부하지만 완벽하게 일치하지 않는 일반 학습 데이터를 효과적으로 활용하는 것입니다. 본 연구에서는 데이터 선택 프레임을 넘어, Dr. Post-Training (데이터 정규화 후속 학습)이라는 새로운 프레임워크를 제시합니다. 이 프레임워크는 일반 학습 데이터를 선택 풀이 아닌, 희소한 목표 객체에 대한 과적합을 방지하는 데이터 유도 정규화기로 재해석합니다. 구체적으로, 본 프레임워크는 각 학습 단계에서 일반 학습 데이터를 사용하여 모델 업데이트 방향의 실현 가능한 집합을 구성하고, 희소한 목표 데이터에 의해 지정된 모델 업데이트 방향을 해당 실현 가능한 집합으로 투영하는 것을 제안합니다. 표준 학습 및 기존 데이터 선택 방법은 데이터 유도 정규화기의 선택에 따라 특별한 경우로 나타나며, 이러한 방법들은 서로 다른 정규화 강도를 갖는 다양한 편향-분산 스펙트럼에 해당합니다. 이러한 관점을 바탕으로, 우리는 더 풍부한 설계 공간과 더 유연한 편향-분산 균형을 제공하는 방법론을 제안합니다. 실제 LLM 규모의 사용을 위해, 본 연구에서는 이러한 방법론을 최소한의 오버헤드로 구현할 수 있도록 신중한 시스템 최적화를 도입했습니다. SFT, RLHF, RLVR 전반에 걸쳐 광범위한 실험을 통해, 제안하는 방법론이 최첨단 데이터 선택 기준 성능을 꾸준히 능가하며, 시스템 벤치마크는 그 효율성을 입증합니다.

Original Abstract

Data selection methods address a critical challenge in LLM post-training: effectively leveraging scarce, high-fidelity target data alongside abundant but imperfectly aligned general training data. In this work, we move beyond the data-selection framing and introduce Dr. Post-Training (Data-Regularized Post-Training), a novel framework that reconceptualizes general training data as a data-induced regularizer that prevents overfitting to the scarce target objective, rather than serving as a pool for selection. Specifically, our framework proposes that at each training step, construct a feasible set of model update directions using the general training data, and project the model update direction specified by the scarce target data onto that feasible set. Standard training and existing data selection methods arise as special cases with different choices of the data-induced regularizer, and these methods correspond to different points on a bias--variance spectrum with different regularization strength. Building on this view, we propose a family of methods offering a richer design space and more flexible bias--variance tradeoffs. For practical LLM-scale use, we introduce careful system optimizations that realize these methods with minimal overhead. Extensive experiments across SFT, RLHF, and RLVR show that our methods consistently outperform state-of-the-art data selection baselines, and system benchmarks confirm their efficiency.

0 Citations

0 Influential

2 Altmetric

10.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!