2605.01874v1 May 03, 2026 cs.LG

데이터 대칭성을 활용하여 레이블 노이즈 환경에서 최적의 학습 데이터 부분집합을 선택하는 방법

Leveraging Data Symmetries to Select an Optimal Subset of Training Data under Label Noise

Xingjun Ma

Citations: 142

h-index: 6

David A. Clifton

Citations: 7

h-index: 2

Pavan Karjol

Citations: 8

h-index: 2

Kumar Shubham

Citations: 12

h-index: 2

A. Prathosh

Citations: 37

h-index: 4

Xinshao Wang

Citations: 212

h-index: 7

Yang Hua

Citations: 215

h-index: 4

E. Kodirov

Citations: 2,501

h-index: 13

N. Robertson

Citations: 1,583

h-index: 19

Imae

Citations: 70

h-index: 3

Yisen Wang

Citations: 15

h-index: 2

Zaiyi Chen

Citations: 1,449

h-index: 9

Yuan Luo

Citations: 12

h-index: 2

Jinfeng Yi

Citations: 146

h-index: 4

R. Winter

Citations: 868

h-index: 9

Marco Bertolini

Citations: 211

h-index: 3

T. Le

Citations: 322

h-index: 7

Frank Noé

Citations: 94

h-index: 3

Djork-Arné Clevert

Citations: 11,637

h-index: 29

Unsupervised

Citations: 7

h-index: 2

머신러닝 모델의 성능은 종종 대규모의 레이블이 부착된 데이터셋에 의존하지만, 다양한 출처에서 수집된 데이터에는 레이블 노이즈가 포함될 수 있습니다. 최근 연구에 따르면, 노이즈가 있는 환경에서 학습 데이터의 부분집합을 선택하면 모델이 노이즈가 없는 데이터셋으로 학습했을 때와 유사한 성능을 달성할 수 있습니다. 이러한 부분집합을 식별하는 데 널리 사용되는 방법 중 하나는 'cutstats'이며, 이는 k-최근접 이웃(k-NN) 알고리즘을 사용하여 노이즈가 적은 샘플을 감지합니다. 그러나 이 방법의 고차원 데이터에 대한 성능은 아직 충분히 연구되지 않았습니다. 본 연구에서는 'cutstats'를 통해 선택된 노이즈 데이터셋의 부분집합으로 학습된 분류기의 성능이 k-NN의 정확도에 영향을 받는다는 것을 공식적으로 입증합니다. 또한, 노이즈 환경에서 데이터 불변성(data invariance)과 잠재적인 대칭성에 대한 지식을 활용하면 k-NN의 성능을 크게 향상시켜, 심지어 고차원에서도 베이즈 최적 분류기에 더 가까워질 수 있음을 보여줍니다. 마지막으로, 실제 시나리오에서는 데이터 불변성에 대한 정보가 부분적으로만 알려진 경우에도, 학습된 불변 표현(learnt invariant representations)이 여전히 거의 최적의 부분집합을 식별하는 데 도움이 될 수 있음을 보여줍니다.

Original Abstract

The performance of machine learning models often relies on large labeled datasets; however, data collected from diverse sources can contain label noise. Recent work has shown that, in noisy settings, there may exist a subset of the training data on which models can achieve performance comparable to training on a noise-free dataset. A widely used method for identifying such subsets is cutstats, which employs k-nearest neighbors (k-NN) to detect low-noise samples. However, its performance on high-dimensional data remains largely unexplored. In this work, we formally establish that the performance of a classifier trained on a subset of a noisy dataset selected via cutstats is influenced by the accuracy of k-NN. We further demonstrate that, in noisy environments, exploiting data invariance and knowledge of underlying symmetries can significantly enhance the performance of k-NN, bringing it closer to the Bayes optimal classifier even in high-dimensional regimes. Finally, we show that for real-world scenarios, where information about the underlying invariance is only partially known, learnt invariant representations can still facilitate the identification of near-optimal subsets.

0 Citations

0 Influential

14.5 Altmetric

72.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!