2601.21296v1 Jan 29, 2026 cs.LG

데이터 증류에서 정보성과 유용성을 향상시키는 방법론

Grounding and Enhancing Informativeness and Utility in Dataset Distillation

Shaobo Wang

Citations: 486

h-index: 11

Linfeng Zhang

Citations: 76

h-index: 5

Kaixin Li

Citations: 47

h-index: 2

Zhaorun Chen

Citations: 30

h-index: 2

Yantai Yang

Citations: 135

h-index: 5

Guo Chen

Citations: 195

h-index: 5

Peiru Li

Citations: 15

h-index: 2

Yufa Zhou

Citations: 27

h-index: 2

데이터 증류(Dataset Distillation, DD)는 대규모의 실제 데이터셋으로부터 압축된 데이터셋을 생성하는 기술입니다. 최근의 방법들은 효율성과 품질의 균형을 맞추기 위해 종종 휴리스틱한 접근 방식을 사용하지만, 원본 데이터와 합성 데이터 간의 근본적인 관계는 충분히 탐구되지 않았습니다. 본 논문에서는 지식 증류 기반의 데이터 증류를 견고한 이론적 틀 내에서 재검토합니다. 우리는 샘플 내의 중요한 정보를 나타내는 '정보성(Informativeness)'과, 학습 데이터셋 내의 필수적인 샘플을 나타내는 '유용성(Utility)'이라는 개념을 도입합니다. 이러한 원칙을 바탕으로, 우리는 최적의 데이터 증류를 수학적으로 정의합니다. 우리는 또한 정보성과 유용성을 균형 있게 고려하여 증류된 데이터셋을 생성하는 프레임워크인 InfoUtil을 제시합니다. InfoUtil은 다음과 같은 두 가지 핵심 구성 요소를 포함합니다. (1) Shapley Value를 활용하여 샘플로부터 핵심 정보를 추출하는 게임 이론 기반의 정보성 극대화, (2) Gradient Norm을 기반으로 전역적으로 영향력 있는 샘플을 선택하여 유용성을 극대화하는 방법입니다. 이러한 구성 요소들은 증류된 데이터셋이 정보성과 유용성 모두에서 최적화되도록 보장합니다. 실험 결과, 제안하는 방법은 ResNet-18 모델을 사용하여 ImageNet-1K 데이터셋에서 기존 최고 성능을 6.1% 향상시키는 것을 보여주었습니다.

Original Abstract

Dataset Distillation (DD) seeks to create a compact dataset from a large, real-world dataset. While recent methods often rely on heuristic approaches to balance efficiency and quality, the fundamental relationship between original and synthetic data remains underexplored. This paper revisits knowledge distillation-based dataset distillation within a solid theoretical framework. We introduce the concepts of Informativeness and Utility, capturing crucial information within a sample and essential samples in the training set, respectively. Building on these principles, we define optimal dataset distillation mathematically. We then present InfoUtil, a framework that balances informativeness and utility in synthesizing the distilled dataset. InfoUtil incorporates two key components: (1) game-theoretic informativeness maximization using Shapley Value attribution to extract key information from samples, and (2) principled utility maximization by selecting globally influential samples based on Gradient Norm. These components ensure that the distilled dataset is both informative and utility-optimized. Experiments demonstrate that our method achieves a 6.1\% performance improvement over the previous state-of-the-art approach on ImageNet-1K dataset using ResNet-18.

2 Citations

0 Influential

5.5 Altmetric

29.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!