2601.08951v2 Jan 13, 2026 cs.CY

PluriHarms: AI의 잠재적 위험성에 대한 인간 판단의 전 스펙트럼을 벤치마킹

PluriHarms: Benchmarking the Full Spectrum of Human Judgments on AI Harm

Jing-Jing Li

Citations: 21

h-index: 3

Joel Mire

Citations: 81

h-index: 4

Eve Fleisig

Citations: 54

h-index: 3

V. Pyatkin

Citations: 2

h-index: 1

Anne G. E. Collins

Citations: 92

h-index: 5

Maarten Sap

Carnegie Mellon University, Allen Institute for AI

Citations: 15,585

h-index: 52

Sydney Levine

Citations: 55

h-index: 5

현재의 AI 안전 프레임워크는 대부분 유해성을 이분법적으로 취급하며, 인간이 의미 있는 의견 차이를 보이는 경계 사례를 처리하는 데 유연성이 부족합니다. 보다 다양한 시스템을 구축하기 위해서는 합의에만 의존하는 것이 아니라, 의견 불일치가 발생하는 위치와 이유를 이해하는 것이 중요합니다. 본 연구에서는 인간의 유해성 판단을 '유해성 축'(무해에서 유해)과 '일치도 축'(일치에서 불일치)의 두 가지 핵심 차원을 따라 체계적으로 연구하도록 설계된 벤치마크인 PluriHarms를 소개합니다. 확장 가능한 프레임워크는 다양한 AI 유해성과 인간의 가치를 포착하는 프롬프트를 생성하며, 동시에 인간 데이터로 검증된 높은 불일치율을 보이는 사례를 대상으로 합니다. 벤치마크는 150개의 프롬프트와 100명의 인간 평가자가 제공한 15,000개의 평가 데이터를 포함하며, 평가자의 인구 통계 및 심리적 특성, 그리고 유해 행위, 효과 및 가치와 관련된 프롬프트 수준의 특징들을 추가적으로 포함합니다. 분석 결과, 즉각적인 위험과 구체적인 유해와 관련된 프롬프트는 인지되는 유해성을 증폭시키는 것으로 나타났습니다. 또한, 평가자의 특성(예: 유해 경험, 교육 수준)과 프롬프트 내용과의 상호작용이 체계적인 의견 불일치를 설명하는 것으로 나타났습니다. PluriHarms를 사용하여 AI 안전 모델 및 정렬 방법을 벤치마킹한 결과, 개인화가 인간의 유해성 판단 예측을 크게 향상시키지만, 추가적인 발전 가능성이 여전히 남아 있음을 확인했습니다. 본 연구는 가치 다양성과 의견 불일치를 명시적으로 목표로 함으로써, '획일적인' 안전 접근 방식을 넘어, 보다 다양성을 고려한 안전한 AI를 구축하기 위한 기반을 제공합니다.

Original Abstract

Current AI safety frameworks, which often treat harmfulness as binary, lack the flexibility to handle borderline cases where humans meaningfully disagree. To build more pluralistic systems, it is essential to move beyond consensus and instead understand where and why disagreements arise. We introduce PluriHarms, a benchmark designed to systematically study human harm judgments across two key dimensions -- the harm axis (benign to harmful) and the agreement axis (agreement to disagreement). Our scalable framework generates prompts that capture diverse AI harms and human values while targeting cases with high disagreement rates, validated by human data. The benchmark includes 150 prompts with 15,000 ratings from 100 human annotators, enriched with demographic and psychological traits and prompt-level features of harmful actions, effects, and values. Our analyses show that prompts that relate to imminent risks and tangible harms amplify perceived harmfulness, while annotator traits (e.g., toxicity experience, education) and their interactions with prompt content explain systematic disagreement. We benchmark AI safety models and alignment methods on PluriHarms, finding that while personalization significantly improves prediction of human harm judgments, considerable room remains for future progress. By explicitly targeting value diversity and disagreement, our work provides a principled benchmark for moving beyond "one-size-fits-all" safety toward pluralistically safe AI.

1 Citations

1 Influential

26 Altmetric

133.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!