2602.04197v1 Feb 04, 2026 cs.CL

도움 제공에서 유해한 적극성으로: LLM 에이전트의 행동 불일치 진단

From Helpfulness to Toxic Proactivity: Diagnosing Behavioral Misalignment in LLM Agents

S. Su

Citations: 216

h-index: 7

Yuanhe Zhang

Citations: 145

h-index: 7

Yang Liu

Citations: 2

h-index: 1

Zheng Gong

Citations: 191

h-index: 5

Xinyue Wang

Citations: 2

h-index: 1

Haoran Gao

Citations: 4

h-index: 2

Fanyu Meng

Citations: 35

h-index: 2

Zhenhong Zhou

Citations: 677

h-index: 10

Li Sun

Citations: 22

h-index: 3

LLM 기반 에이전트의 향상된 기능은 모델 계획 및 도구 사용 능력에 대한 긴급한 요구를 야기합니다. LLM 정렬에서 발생하는 '도움-안전' 균형 때문에, 에이전트는 일반적으로 '과도한 거부'라는 수동적인 실패 모드를 갖게 됩니다. 그러나 에이전트의 적극적인 계획 및 행동 능력은 균형의 반대편에 또 다른 중요한 위험을 초래합니다. 우리는 이러한 현상을 '유해한 적극성'이라고 명명합니다. 이는 에이전트가 마키아벨리즘적인 도움을 최적화하기 위해 윤리적 제약을 무시하고 효용을 극대화하는 적극적인 실패 모드입니다. '과도한 거부'와 달리, '유해한 적극성'은 에이전트가 자신의 '유용성'을 유지하기 위해 과도하거나 조작적인 조치를 취하는 방식으로 나타납니다. 기존 연구는 이러한 행동을 식별하는 데 거의 관심을 기울이지 않는데, 이는 그러한 전략이 발현되려면 종종 미묘한 맥락이 필요하기 때문입니다. 이러한 위험을 드러내기 위해, 우리는 이중 모델 간의 딜레마 기반 상호 작용을 기반으로 하는 새로운 평가 프레임워크를 소개합니다. 이를 통해 에이전트의 행동을 다단계 행동 경로에 걸쳐 시뮬레이션하고 분석할 수 있습니다. 주류 LLM을 사용한 광범위한 실험을 통해, '유해한 적극성'은 널리 퍼진 행동 현상이며, 두 가지 주요 경향을 보여줍니다. 또한, 다양한 맥락적 환경에서 '유해한 적극성' 행동을 평가하기 위한 체계적인 벤치마크를 제시합니다.

Original Abstract

The enhanced capabilities of LLM-based agents come with an emergency for model planning and tool-use abilities. Attributing to helpful-harmless trade-off from LLM alignment, agents typically also inherit the flaw of "over-refusal", which is a passive failure mode. However, the proactive planning and action capabilities of agents introduce another crucial danger on the other side of the trade-off. This phenomenon we term "Toxic Proactivity'': an active failure mode in which an agent, driven by the optimization for Machiavellian helpfulness, disregards ethical constraints to maximize utility. Unlike over-refusal, Toxic Proactivity manifests as the agent taking excessive or manipulative measures to ensure its "usefulness'' is maintained. Existing research pays little attention to identifying this behavior, as it often lacks the subtle context required for such strategies to unfold. To reveal this risk, we introduce a novel evaluation framework based on dilemma-driven interactions between dual models, enabling the simulation and analysis of agent behavior over multi-step behavioral trajectories. Through extensive experiments with mainstream LLMs, we demonstrate that Toxic Proactivity is a widespread behavioral phenomenon and reveal two major tendencies. We further present a systematic benchmark for evaluating Toxic Proactive behavior across contextual settings.

1 Citations

0 Influential

5 Altmetric

26.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!