2602.08235v1 Feb 09, 2026 cs.CL

선한 입력이 심각한 피해로 이어질 때: 컴퓨터 사용 에이전트의 안전하지 않은 의도하지 않은 동작 유도

When Benign Inputs Lead to Severe Harms: Eliciting Unsafe Unintended Behaviors of Computer-Use Agents

D. Song

Citations: 66

h-index: 4

Jaylen Jones

Citations: 53

h-index: 4

Zhehao Zhang

Citations: 8

h-index: 2

Yuting Ning

Citations: 288

h-index: 5

Eric Fosler-Lussier

Citations: 90

h-index: 5

P. St-Charles

Citations: 2,017

h-index: 15

Y. Bengio

Citations: 1,243

h-index: 11

Yu Su

Citations: 19

h-index: 2

Huan Sun

Citations: 12

h-index: 3

컴퓨터 사용 에이전트(CUA)는 운영 체제 워크플로우를 자동화하는 데 큰 잠재력을 가지고 있지만, 예상되는 결과와는 다른 안전하지 않은 의도하지 않은 동작을 보일 수 있으며, 이는 양성적인 입력 환경에서도 발생할 수 있습니다. 그러나 이러한 위험에 대한 연구는 아직 주로 일화적인 수준이며, 구체적인 특성 분석과 현실적인 CUA 시나리오에서 잠재적인 문제를 사전에 파악할 수 있는 자동화된 방법이 부족합니다. 이러한 격차를 해소하기 위해, 우리는 CUA의 의도하지 않은 동작에 대한 최초의 개념적 및 방법론적 프레임워크를 제시합니다. 이 프레임워크는 주요 특징을 정의하고, 자동으로 이러한 동작을 유도하며, 이러한 동작이 어떻게 양성적인 입력으로부터 발생하는지 분석합니다. 우리는 AutoElicit이라는 에이전트 기반 프레임워크를 제안합니다. AutoElicit은 CUA 실행 피드백을 사용하여 양성적인 지시사항을 반복적으로 변경하고, 현실적이고 양성적인 변화를 유지하면서 심각한 피해를 유발합니다. AutoElicit을 사용하여 Claude 4.5 Haiku 및 Opus와 같은 최첨단 CUA에서 수백 가지의 유해한 의도하지 않은 동작을 발견했습니다. 또한, 인간이 검증한 성공적인 변경 사항의 일반화 가능성을 평가하여 다양한 최첨단 CUA에서 의도하지 않은 동작에 대한 지속적인 취약성을 확인했습니다. 이 연구는 현실적인 컴퓨터 사용 환경에서 의도하지 않은 동작을 체계적으로 분석하기 위한 기반을 마련합니다.

Original Abstract

Although computer-use agents (CUAs) hold significant potential to automate increasingly complex OS workflows, they can demonstrate unsafe unintended behaviors that deviate from expected outcomes even under benign input contexts. However, exploration of this risk remains largely anecdotal, lacking concrete characterization and automated methods to proactively surface long-tail unintended behaviors under realistic CUA scenarios. To fill this gap, we introduce the first conceptual and methodological framework for unintended CUA behaviors, by defining their key characteristics, automatically eliciting them, and analyzing how they arise from benign inputs. We propose AutoElicit: an agentic framework that iteratively perturbs benign instructions using CUA execution feedback, and elicits severe harms while keeping perturbations realistic and benign. Using AutoElicit, we surface hundreds of harmful unintended behaviors from state-of-the-art CUAs such as Claude 4.5 Haiku and Opus. We further evaluate the transferability of human-verified successful perturbations, identifying persistent susceptibility to unintended behaviors across various other frontier CUAs. This work establishes a foundation for systematically analyzing unintended behaviors in realistic computer-use settings.

5 Citations

0 Influential

7.5 Altmetric

42.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!