2604.23290v1 Apr 25, 2026 cs.LG

실제 데이터 기반 크라우드 소싱 텍스트 어노테이션을 활용한 능동 학습 알고리즘 분석

An Analysis of Active Learning Algorithms using Real-World Crowd-sourced Text Annotations

Shayok Chakraborty

Citations: 3,967

h-index: 21

Yushun Dong

Citations: 189

h-index: 8

Varun Totakura

Citations: 12

h-index: 2

Ankita Singh

Citations: 260

h-index: 6

능동 학습 알고리즘은 대량의 레이블이 없는 데이터에서 가장 유용한 샘플을 자동으로 식별하여, 머신러닝 모델 구축 과정에서 인간의 어노테이션 노력을 크게 줄여줍니다. 기존의 능동 학습 환경에서는 레이블 제공자가 항상 정확한 답변(클래스 레이블)을 제공한다고 가정하지만, 실제 응용 환경에서는 이러한 보장이 어렵습니다. 따라서, 불완전하거나 노이즈가 있는 레이블 제공자가 존재하는 상황에서의 능동 학습 알고리즘 개발에 대한 연구가 진행되어 왔습니다. 기존의 연구에서는 노이즈가 있는 레이블 제공자를 시뮬레이션하기 위해 머신러닝 모델을 사용하지만, 실제 환경은 훨씬 복잡하며, ML 모델을 사용하여 어노테이션 패턴을 시뮬레이션하는 것은 실제 어노테이션의 미묘한 측면을 제대로 반영하지 못할 수 있습니다. 본 연구에서는 먼저 크라우드 소싱 플랫폼을 통해 세 개의 벤치마크 텍스트 분류 데이터셋의 텍스트 샘플에 대한 어노테이션을 수집합니다. 이후, 수집된 어노테이션을 사용하여 8가지 널리 사용되는 능동 학습 기법(심층 신경망과 함께)에 대한 광범위한 실험을 수행합니다. 분석 결과는 어노테이터가 잘못된 레이블을 제공하거나 레이블을 제공하지 않을 수 있는 실제 환경에서의 이러한 기법들의 성능에 대한 통찰력을 제공합니다. 본 연구가 실제 응용 환경에 딥 능동 학습 시스템을 배포하는 데 유용한 정보를 제공할 수 있기를 바랍니다. 수집된 어노테이션은 https://github.com/varuntotakura/al_rcta/ 에서 확인할 수 있습니다.

Original Abstract

Active learning algorithms automatically identify the most informative samples from large amounts of unlabeled data and tremendously reduce human annotation effort in inducing a machine learning model. In a conventional active learning setup, the labeling oracles are assumed to be infallible, that is, they always provide correct answers (in terms of class labels) to the queried unlabeled instances, which cannot be guaranteed in real-world applications. To this end, a body of research has focused on the development of active learning algorithms in the presence of imperfect / noisy oracles. Existing research on active learning with noisy oracles typically simulate the oracles using machine learning models; however, real-world situations are much more challenging, and using ML models to simulate the annotation patterns may not appropriately capture the nuances of real-world annotation challenges. In this research, we first collect annotations of text samples (from 3 benchmark text classification datasets) from crowd-sourced workers through a crowd-sourcing platform. We then conduct extensive empirical studies of 8 commonly used active learning techniques (in conjunction with deep neural networks) using the obtained annotations. Our analyses sheds light on the performance of these techniques under real-world challenges, where annotators can provide incorrect labels, and can also refuse to provide labels. We hope this research will provide valuable insights that will be useful for the deployment of deep active learning systems in real-world applications. The obtained annotations can be accessed at https://github.com/varuntotakura/al_rcta/.

0 Citations

0 Influential

30.5 Altmetric

152.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!