2603.14417v1 Mar 15, 2026 cs.CY

설문 조사 응답은 인공지능 에이전트의 안전성을 제대로 반영하지 못한다

Questionnaire Responses Do not Capture the Safety of AI Agents

Citations: 20

h-index: 1

Citations: 7

h-index: 2

인공지능 시스템의 기능이 발전함에 따라, 그 안전성과 인간 가치와의 일관성을 측정하는 것이 매우 중요해지고 있습니다. 이러한 측정 방법을 개발하는 인공지능 연구 분야가 빠르게 성장하고 있습니다. 그러나 현재의 대부분의 방법은 실제 환경에서의 인공지능 시스템 평가에는 적합하지 않을 수 있습니다. 일반적인 방법은 대규모 언어 모델(LLM)에게 설문 조사 형태로 질문하여 가치나 행동을 가상 시나리오에서 설명하도록 합니다. 이러한 방법은 LLM 자체에만 초점을 맞추기 때문에, 실제 관련 행동을 수행할 수 있는 인공지능 에이전트를 평가하는 데는 한계가 있으며, 따라서 더 큰 위험을 초래할 수 있습니다. LLM이 설문 조사 형식의 질문에 응답하는 방식은 동일한 LLM을 기반으로 하는 에이전트의 방식과 크게 다릅니다. 이는 입력, 가능한 행동, 환경과의 상호 작용, 그리고 내부 처리 과정에서의 차이로 나타납니다. 따라서 LLM의 시나리오 설명에 대한 응답은 해당 LLM 에이전트의 실제 행동을 제대로 반영하지 못할 가능성이 높습니다. 우리는 또한 이러한 평가 방법이 LLM이 자신의 반사실적 행동에 대해 정확하게 보고할 수 있는 능력과 경향에 대해 과도한 가정을 하고 있다고 주장합니다. 이러한 가정은 실제 환경에서의 인공지능 시스템의 위험을 평가하는 데 적합하지 않으며, 구성 타당성이 부족합니다. 또한, 현재의 인공지능 정렬 접근 방식에서도 구조적으로 동일한 문제가 발생한다고 주장합니다. 마지막으로, 우리는 이러한 문제점을 고려하여 안전성 평가 및 정렬 훈련을 개선하는 방법에 대해 논의합니다.

Original Abstract

As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount. A fast-growing field of AI research is devoted to developing such assessments. However, most current advances therein may be ill-suited for assessing AI systems across real-world deployments. Standard methods prompt large language models (LLMs) in a questionnaire-style to describe their values or behavior in hypothetical scenarios. By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks. LLMs' engagement with scenarios described by questionnaire-style prompts differs starkly from that of agents based on the same LLMs, as reflected in divergences in the inputs, possible actions, environmental interactions, and internal processing. As such, LLMs' responses to scenario descriptions are unlikely to be representative of the corresponding LLM agents' behavior. We further contend that such assessments make strong assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack construct validity. We then argue that a structurally identical issue holds for current AI alignment approaches. Lastly, we discuss improving safety assessments and alignment training by taking these shortcomings to heart.

0 Citations

0 Influential

1 Altmetric

5.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!