2605.14773v1 May 14, 2026 cs.LG

선택해야 할 것 이상: 효율적인 모델 훈련을 위한 플러그 앤 플레이 형태의 주기적인 데이터 양 스케줄링

Beyond What to Select: A Plug-and-play Oscillatory Data-Volume Scheduling for Efficient Model Training

Han Zhu

Citations: 1,404

h-index: 4

Soujanya Poria

Citations: 36,193

h-index: 80

Furao Shen

Citations: 65

h-index: 5

Suorong Yang

Nanjing University

Citations: 677

h-index: 10

Haixiang Gan

Citations: 2

h-index: 1

Fangjian Su

Citations: 0

h-index: 0

Guangqi Li

Citations: 0

h-index: 0

데이터 선택은 모델 성능을 유지하면서 대표적인 훈련 데이터를 식별하여 훈련 속도를 향상시킵니다. 그러나 기존 방법은 주로 샘플의 중요도 기준을 설계하는 데 초점을 맞추고, 즉, 어떤 데이터를 선택할지 결정하는 데 집중하는 반면, 훈련 과정 전체에서 선택된 데이터의 양을 일반적으로 고정된 비율로 유지합니다. 따라서 이러한 방법은 샘플의 식별은 동적이지만 데이터 양은 정적인 경향이 있습니다. 본 연구에서는 데이터 선택을 최적화 관점에서 재검토하고, 선택된 데이터로 훈련하는 것이 즉각적인 선택 비율에 의해 조절되는 암묵적인 정규화 효과를 유발한다는 것을 보여줍니다. 이는 중요한 균형을 보여줍니다. 즉, 낮은 비율은 선택에 의한 정규화를 강화하는 반면, 높은 비율은 데이터의 전체적인 범위를 유지하고 최적화의 정확성을 높입니다. 이러한 통찰력을 바탕으로, 본 연구에서는 플러그 앤 플레이 형태의 주기적인 데이터 양 스케줄링 프레임워크인 PODS를 제안합니다. PODS는 또 다른 샘플 점수 매트릭스를 도입하는 대신, 훈련 과정에서 선택해야 할 데이터의 양을 동적으로 조정하는 경량 모듈로 작동합니다. 목표 선택 비율 하에서, PODS는 선택에 의한 정규화를 활용하면서 최적화의 안정성을 희생하지 않도록, 낮은 비율의 정규화 단계와 높은 비율의 복구 단계를 번갈아 가며 사용합니다. PODS는 경량화된 설계, 비율 수준의 제어, 그리고 특정 작업에 독립적인 특징을 가지므로 기존의 정적 및 동적 선택 방법과 호환되며, 다양한 훈련 패러다임에 널리 적용될 수 있습니다. 다양한 데이터셋, 아키텍처 및 작업에 대한 실험 결과, PODS는 일관적으로 효율성과 일반화 성능의 균형을 개선하며, 예를 들어 ImageNet-1k 데이터셋의 훈련 비용을 50% 절감하면서 정확도를 향상시키고, LLM의 명령어 튜닝 속도를 2배 이상 가속화하면서 성능 저하를 방지합니다.

Original Abstract

Data selection accelerates training by identifying representative training data while preserving model performance. However, existing methods mainly focus on designing sample-importance criteria, i.e., deciding what to select, while typically fixing the selected data volume as the target ratio throughout training. Thus, they are often dynamic in sample identity but static in data volume. In this work, we revisit data selection from an optimization perspective and show that selected-data training induces an implicit regularization effect modulated by the instantaneous selection ratio. This reveals a key trade-off: lower ratios amplify selection-induced regularization, whereas higher ratios preserve data coverage and optimization fidelity. Motivated by this insight, we propose PODS, a Plug-and-play Oscillatory Data-volume Scheduling framework. Rather than introducing another sample-scoring metric, PODS serves as a lightweight module that dynamically schedules how much data to select over training. Under the target selection ratio, PODS alternates between low-ratio regularization phases and high-ratio recovery phases to exploit selection-induced regularization without sacrificing optimization stability. With its lightweight, ratio-level, and task-agnostic design, PODS is compatible with existing static and dynamic selection methods and broadly applicable across training paradigms. Experiments across various datasets, architectures, and tasks show that PODS consistently improves the efficiency-generalization trade-off, e.g., reducing ImageNet-1k training cost by 50% with improved accuracy and accelerating LLM instruction tuning by over 2x without performance degradation.

0 Citations

0 Influential

30 Altmetric

150.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!