2604.13899v1 Apr 15, 2026 cs.CL

우리는 여전히 인간의 개입이 필요할까요? 호감성 탐지를 위한 능동 학습에서 인간과 LLM 주석의 비교

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

Hinrich Schutze

Citations: 794

h-index: 14

Isabelle Augenstein

Citations: 1

h-index: 1

Ahmad Dawar Hakimi

Citations: 35

h-index: 4

Lea Hirlimann

Citations: 2

h-index: 1

Instruction-tuned LLM은 짧은 프롬프트를 통해 상당한 비용 없이 수천 개의 인스턴스에 대한 주석을 생성할 수 있습니다. 이는 능동 학습(AL)에 대해 다음과 같은 두 가지 질문을 제기합니다. LLM 주석이 AL 루프 내에서 인간 주석을 대체할 수 있는가? 그리고 전체 코퍼스를 한 번에 주석화할 수 있는 경우, AL은 여전히 필요한가? 우리는 277,902개의 독일 정치 틱톡 댓글(LLM으로 주석 처리된 25,974개, 인간이 주석을 단 5,000개)로 구성된 새로운 데이터 세트를 사용하여, 혐오 표현 탐지를 위해 네 가지 인코더에서 일곱 가지 주석 전략을 비교하여 이러한 질문을 조사했습니다. 25,974개의 GPT-5.2 주석(43달러)으로 훈련된 분류기는 3,800개의 인간 주석(316달러)으로 훈련된 분류기와 유사한 F1-Macro 값을 달성했습니다. 사전 풍부화된 데이터 풀에서 능동 학습은 무작위 샘플링보다 큰 이점을 제공하지 않으며, 동일한 비용으로 전체 LLM 주석보다 낮은 F1 값을 제공합니다. 그러나 유사한 전체 F1 값은 체계적인 오류 구조의 차이를 숨깁니다. LLM으로 훈련된 분류기는 인간의 표준에 비해 긍정 클래스를 과도하게 예측합니다. 이러한 차이는 주제적으로 모호한 토론에서 가장 두드러지며, 반이민 혐오 표현과 정책 비판 사이의 구별이 가장 미묘한 부분에서 나타납니다. 이는 주석 전략이 전체 F1 값뿐만 아니라 대상 애플리케이션에 허용 가능한 오류 프로필에 따라 결정되어야 함을 시사합니다.

Original Abstract

Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels (\$43) achieves comparable F1-Macro to one trained on 3,800 human annotations (\$316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained classifiers over-predict the positive class relative to the human gold standard. This divergence concentrates in topically ambiguous discussions where the distinction between anti-immigrant hostility and policy critique is most subtle, suggesting that annotation strategy should be guided not by aggregate F1 alone but by the error profile acceptable for the target application.

0 Citations

0 Influential

7 Altmetric

35.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!