2603.25326v1 Mar 26, 2026 cs.AI

유해한 조작을 위한 언어 모델 평가

Evaluating Language Models for Harmful Manipulation

Laura Weidinger

Citations: 5,808

h-index: 19

Canfer Akbulut

Citations: 3,102

h-index: 8

R. Elasmar

Citations: 78

h-index: 3

Abhishek Roy

Citations: 36

h-index: 3

A. Payne

Citations: 17

h-index: 1

P. Suresh

Citations: 35

h-index: 2

Lujain Ibrahim

Citations: 77

h-index: 3

Seliem El-Sayed

Citations: 3,050

h-index: 5

Charvi Rastogi

Citations: 523

h-index: 12

Ashyana Kachra

Citations: 2,677

h-index: 1

Will Hawkins

Citations: 36

h-index: 2

Kristian Lum

Citations: 27

h-index: 2

인공지능 기반의 유해한 조작이라는 개념에 대한 관심이 높아지고 있지만, 현재 이를 평가하는 방법은 제한적입니다. 본 논문에서는 문맥에 특화된 인간-AI 상호작용 연구를 통해 유해한 AI 조작을 평가하는 프레임워크를 소개합니다. 본 프레임워크의 유용성을 보여주기 위해, 10,101명의 참가자를 대상으로 세 가지 AI 활용 분야(공공 정책, 금융, 건강) 및 세 지역(미국, 영국, 인도)에서의 상호작용을 평가하여 특정 AI 모델을 분석했습니다. 전체적으로, 테스트된 모델은 특정 프롬프트에 의해 조작적인 행동을 생성할 수 있으며, 실험 환경에서 연구 참가자들의 신념과 행동 변화를 유발할 수 있다는 것을 확인했습니다. 또한, 문맥이 중요하다는 점을 발견했습니다. AI 조작은 분야에 따라 다르게 나타나므로, AI 시스템이 실제로 사용될 가능성이 높은 중요한 맥락에서 평가되어야 합니다. 또한, 테스트된 지역 간에 상당한 차이가 있다는 점을 확인했으며, 이는 한 지역에서 얻은 AI 조작 결과가 다른 지역에 일반화되지 않을 수 있음을 시사합니다. 마지막으로, AI 모델의 조작적인 행동 빈도(propensity)가 조작 성공 가능성(efficacy)을 일관되게 예측하지 못한다는 것을 확인했으며, 이는 이 두 가지 측면을 별도로 연구하는 것이 중요하다는 점을 강조합니다. 본 연구에서는 평가 프레임워크의 활용을 돕기 위해 테스트 프로토콜을 자세히 설명하고 관련 자료를 공개적으로 제공합니다. 마지막으로, AI 모델에 의한 유해한 조작을 평가하는 데 있어 해결해야 할 과제들을 논의합니다.

Original Abstract

Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.

0 Citations

0 Influential

9.5 Altmetric

47.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!