2604.18135v1 Apr 20, 2026 cs.CV

대규모 데이터 증류를 위한 소프트 라벨 가지치기 및 양자화

Soft Label Pruning and Quantization for Large-Scale Dataset Distillation

Lingao Xiao

Citations: 439

h-index: 5

Yang He

A*STAR - Agency for Science, Technology and Research

Citations: 3,485

h-index: 14

대규모 데이터 증류는 보조 소프트 라벨을 저장해야 하는데, 이는 ImageNet-1K 데이터셋에서 원본 이미지보다 30~40배 더 크고, ImageNet-21K 데이터셋에서는 200배 더 큰 용량을 차지하여 데이터 압축이라는 목표를 저해합니다. 이러한 광범위한 라벨이 필요한 근본적인 문제는 두 가지입니다. (1) 이미지 다양성의 부족: 합성 이미지 내의 높은 클래스 내 유사성으로 인해 광범위한 증강이 필요하고, (2) 지도 다양성의 부족: 학습 과정에서 제한된 종류의 지도 신호로 인해 높은 압축률에서 성능 저하가 발생합니다. 이러한 문제점을 해결하기 위해, 우리는 대규모 증류를 위한 라벨 가지치기 및 양자화 (LPQLD) 방법을 제안합니다. 우리는 클래스별 배치 처리 및 배치 정규화 지도를 통해 이미지 다양성을 향상시킵니다. 지도 다양성을 높이기 위해, 우리는 동적 지식 재사용을 통한 라벨 가지치기를 도입하여 증강당 라벨의 다양성을 개선하고, 보정된 학생-교사 정렬을 통한 라벨 양자화를 통해 이미지당 증강의 다양성을 개선합니다. 우리의 방법은 ImageNet-1K 데이터셋에서 소프트 라벨 저장량을 78배, ImageNet-21K 데이터셋에서 500배 줄이는 동시에 정확도를 각각 7.2% 및 2.8%까지 향상시킵니다. 광범위한 실험을 통해 LPQLD가 다양한 네트워크 아키텍처 및 데이터 증류 방법에서 우수한 성능을 보임을 확인했습니다. 코드: https://github.com/he-y/soft-label-pruning-quantization-for-dataset-distillation

Original Abstract

Large-scale dataset distillation requires storing auxiliary soft labels that can be 30-40x larger on ImageNet-1K and 200x larger on ImageNet-21K than the condensed images, undermining the goal of dataset compression. We identify two fundamental issues necessitating such extensive labels: (1) insufficient image diversity, where high within-class similarity in synthetic images requires extensive augmentation, and (2) insufficient supervision diversity, where limited variety in supervisory signals during training leads to performance degradation at high compression rates. To address these challenges, we propose Label Pruning and Quantization for Large-scale Distillation (LPQLD). We enhance image diversity via class-wise batching and batch-normalization supervision during synthesis. For supervision diversity, we introduce Label Pruning with Dynamic Knowledge Reuse to improve label-per-augmentation diversity, and Label Quantization with Calibrated Student-Teacher Alignment to improve augmentation-per-image diversity. Our approach reduces soft label storage by 78x on ImageNet-1K and 500x on ImageNet-21K while improving accuracy by up to 7.2% and 2.8%, respectively. Extensive experiments validate the superiority of LPQLD across different network architectures and dataset distillation methods. Code is available at https://github.com/he-y/soft-label-pruning-quantization-for-dataset-distillation.

1 Citations

0 Influential

27 Altmetric

136.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!