2603.04981v1 Mar 05, 2026 cs.AI

동적 데이터 선택에서의 대표성 및 다양성 재고: 연구

Rethinking Representativeness and Diversity in Dynamic Data Selection

Haiyun Guo

Citations: 1,641

h-index: 19

Zhenglin Hua

Citations: 23

h-index: 1

Yuheng Jia

Citations: 23

h-index: 1

Yuzhe Zhou

Citations: 27

h-index: 3

동적 데이터 선택은 데이터셋의 변화하는 부분집합을 샘플링하여 훈련 속도를 높이는 동시에 정확도를 유지합니다. 본 연구에서는 샘플 평가의 핵심 개념인 대표성과 다양성에 대해 다시 생각해 보았습니다. 기존의 로컬 기하학적 중심성 대신, 우리는 대표성을 데이터셋 수준의 일반적이거나 높은 빈도를 갖는 특징 요인을 얼마나 잘 포함하는지를 나타내는 것으로 정의합니다. 또한, 기존의 부분집합 내 분산 대신, 다양성을 프로세스 수준에서 정의하며, 훈련 과정 동안 선택 경로가 점진적으로 상호 보완적인 희귀 요인을 포함하도록 요구합니다. 이러한 관점에 기반하여, 우리는 세 가지 구성 요소로 이루어진 동적 선택 프레임워크를 제안합니다. 첫째, 플러그인 특징 공간에서 대표성을 평가하여 빈번한 요인을 포함하는 샘플에 우선순위를 부여합니다. 이를 위해, 대상 데이터셋으로 훈련된 희소 오토인코더를 사용하여 희소 활성화 단위를 통해 개별 샘플과 데이터셋 전체의 특징 통계 모두를 요약합니다. 둘째, 희귀 요인 샘플링과 사용 빈도 페널티를 결합하여 프로세스 수준의 다양성을 구현합니다. 이 방법은 샘플 교체를 촉진하고, 독점을 방지하며, 그래디언트 편향을 줄입니다. 셋째, 두 가지 차원의 점수를 사용하여 핵심 패턴 통합에서 희귀 요인 탐색으로의 전환을 원활하게 수행하는 스케줄러를 사용합니다. 이 과정은 추가적인 그래디언트, 영향 추정 또는 훈련 모델에 대한 2차 계산 없이 이루어집니다. 시각 및 텍스트 작업에 대한 다섯 가지 벤치마크에서 수행한 광범위한 실험 결과, 다양한 모델에서 정확도-효율성 균형이 향상되는 것을 확인했습니다. 우리의 방법은 전체 데이터셋을 사용하는 것과 동등하거나 더 나은 정확도를 달성하면서 훈련 속도를 2배 이상 향상시켰습니다. 코드 공개 예정입니다.

Original Abstract

Dynamic data selection accelerates training by sampling a changing subset of the dataset while preserving accuracy. We rethink two core notions underlying sample evaluation: representativeness and diversity. Instead of local geometric centrality, we define representativeness as coverage of dataset-level common or high-frequency feature factors. Instead of within-subset dispersion, we define diversity at the process level, requiring the selection trajectory to gradually include complementary rare factors over training. Based on this view, we propose a dynamic selection framework with three components. First, we score representativeness in a plug-in feature space to prioritize samples covering frequent factors. We instantiate this with a sparse autoencoder trained on the target dataset, using sparse unit activations to summarize both individual samples and dataset-wide factor statistics. Second, we realize process-level diversity by combining rare-factor sampling with a Usage-Frequency Penalty that promotes sample rotation, provably discourages monopoly, and reduces gradient bias. Third, we couple the two-dimensional scoring with a smooth scheduler that transitions selection from core-pattern consolidation to rare-factor exploration, without extra gradients, influence estimates, or second-order computations on the training model. Extensive experiments on five benchmarks across vision and text tasks demonstrate improved accuracy-efficiency trade-offs across models. Our method matches or exceeds full-data accuracy with over 2x training acceleration. Code will be released.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!