2605.05329v1 May 06, 2026 cs.AI

해석 가능성을 활용한 어노테이터 안전 정책 이해

Understanding Annotator Safety Policy with Interpretability

Sunnie S. Y. Kim

Citations: 247

h-index: 3

Leon Gatys

Citations: 1

h-index: 1

Alexander X. Oesterling

Citations: 231

h-index: 7

Donghao Ren

Citations: 35

h-index: 4

Yannick Assogba

Citations: 54

h-index: 4

Dominik Moritz

Citations: 23

h-index: 2

Fred Hohman

Citations: 35

h-index: 3

안전 정책은 안전하고 위험한 AI 출력의 기준을 정의하며, 데이터 어노테이션 및 모델 개발을 안내합니다. 그러나 어노테이터 간의 의견 불일치는 광범위하게 나타나며, 이는 운영상의 오류(어노테이터가 작업을 잘못 이해하거나 수행하는 경우), 정책의 모호성(정책 문구가 해석의 여지를 남기는 경우) 또는 가치 다양성(어노테이터가 안전에 대한 서로 다른 관점을 갖는 경우)과 같은 다양한 원인에서 비롯될 수 있습니다. 이러한 원인을 구별하는 것은 중요합니다. 예를 들어, 운영상의 오류는 품질 관리를 요구하고, 모호성은 정책 명확화를 요구하며, 가치 다양성은 다양한 관점을 통합하기 위한 논의를 필요로 합니다. 그러나 어노테이터가 왜 의견이 다른지 이해하는 것은 어렵습니다. 어노테이터에게 직접 그 이유를 묻는 것은 비용이 많이 들고, 어노테이션 부담을 크게 증가시키며, 인간 및 LLM 어노테이터 모두에게 신뢰성이 떨어질 수 있습니다. 왜냐하면 자기 보고된 이유는 종종 실제 의사 결정 과정을 반영하지 못하기 때문입니다. 본 연구에서는 어노테이터의 내부 안전 정책을 어노테이션 행동만으로 학습하고, 어노테이터의 추론 과정을 시각화하고 비교할 수 있도록 하는 해석 가능한 모델인 어노테이터 정책 모델(Annotator Policy Models, APMs)을 소개합니다. APMs가 어노테이터의 안전 정책을 정확하게 모델링하고(>80% 정확도), 반사실적 편집에 대한 응답을 충실하게 예측하며, 통제된 환경에서 알려진 정책 차이를 복구할 수 있음을 확인했습니다. APMs를 LLM 및 인간 어노테이션에 적용하여 두 가지 핵심 응용 분야를 보여줍니다. (1) 안전 지침을 어노테이터가 어떻게 다르게 해석하는지를 보여줌으로써 정책의 모호성을 파악하고, (2) 인구 집단 간의 안전 우선순위의 체계적인 차이를 밝혀냄으로써 가치 다양성을 파악합니다. 이러한 기능은 더욱 표적화되고 투명하며 포괄적인 안전 정책 설계에 기여합니다.

Original Abstract

Safety policies define what constitutes safe and unsafe AI outputs, guiding data annotation and model development. However, annotation disagreement is pervasive and can stem from multiple sources such as operational failures (annotators misunderstand or misexecute the task), policy ambiguity (policy wording leaves room for interpretation), or value pluralism (different annotators hold different perspectives on safety). Distinguishing these sources matters. For example, operational failures call for quality control, ambiguity calls for policy clarification, and pluralism calls for deliberation about incorporating diverse perspectives. Yet understanding why annotators disagree is difficult. Directly asking annotators for their reasoning is costly, substantially increasing annotation burden, and can be unreliable for both human and LLM annotators as self-reported reasoning often fails to reflect actual decision processes. We introduce Annotator Policy Models (APMs), interpretable models that learn annotators' internal safety policies from labeling behavior alone, making annotator reasoning visible and comparable without additional annotation effort. We validate that APMs accurately model annotator safety policy (>80% accuracy), faithfully predict responses to counterfactual edits, and recover known policy differences in controlled settings. Applying APMs to LLM and human annotations, we demonstrate two core applications: (1) surfacing policy ambiguity by revealing how annotators interpret safety instructions differently, and (2) surfacing value pluralism by uncovering systematic differences in safety priorities across demographic groups. Together, these capabilities support more targeted, transparent, and inclusive safety policy design.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!