2604.04341v1 Apr 06, 2026 cs.AI

LLM 기반 에이전트의 안전한 협상을 위한 대리 목표 구현

Implementing surrogate goals for safer bargaining in LLM-based agents

Caspar Oesterheld

Citations: 186

h-index: 4

Maxime Rich'e

Citations: 42

h-index: 3

Filip Sondej

Citations: 49

h-index: 3

Jesse Clifton

Citations: 217

h-index: 4

Vincent Conitzer

Citations: 15

h-index: 2

대리 목표는 협상 실패로 인한 위험을 줄이는 전략으로 제안되어 왔습니다. 대리 목표는 원칙 주체가 AI 에이전트에게 부여할 수 있으며, 에이전트에게 가해지는 위협을 원칙 주체가 중요하게 생각하는 목표로부터 분산시키는 역할을 합니다. 예를 들어, 에이전트가 돈이 소각되는 것을 막는 것에 관심을 갖도록 만들 수 있습니다. 그러면 협상 과정에서 다른 에이전트들은 돈을 낭비하는 대신, 돈을 소각하겠다고 위협할 수 있습니다. 중요한 점은 에이전트가 원칙 주체가 중요하게 생각하는 목표를 달성하는 것만큼 돈이 소각되는 것을 막는 것에 동일하게 관심을 가져야 한다는 것입니다. 본 논문에서는 언어 모델 기반 에이전트에 대리 목표를 구현합니다. 특히, 언어 모델 기반 에이전트가 돈을 소각하겠다는 위협에 대해 '일반적인' 위협에 반응하는 것과 동일한 방식으로 반응하도록 만드는 것을 목표로 합니다. 우리는 프롬프팅, 미세 조정 및 스캐폴딩 기술을 사용하는 네 가지 방법을 제안합니다. 우리는 네 가지 방법을 실험적으로 평가했습니다. 그 결과, 스캐폴딩 및 미세 조정을 기반으로 한 방법이 간단한 프롬프팅 방법을 능가하는 것으로 나타났습니다. 특히, 미세 조정 및 스캐폴딩은 대리 목표에 대한 위협과 관련하여 원하는 동작을 보다 정확하게 구현합니다. 또한, 우리는 다양한 방법이 다른 상황에서의 능력 및 성향에 미치는 부작용을 비교했습니다. 그 결과, 스캐폴딩 기반 방법이 가장 우수한 성능을 보이는 것으로 나타났습니다.

Original Abstract

Surrogate goals have been proposed as a strategy for reducing risks from bargaining failures. A surrogate goal is goal that a principal can give an AI agent and that deflects any threats against the agent away from what the principal cares about. For example, one might make one's agent care about preventing money from being burned. Then in bargaining interactions, other agents can threaten to burn their money instead of threatening to spending money to hurt the principal. Importantly, the agent has to care equally about preventing money from being burned as it cares about money being spent to hurt the principal. In this paper, we implement surrogate goals in language-model-based agents. In particular, we try to get a language-model-based agent to react to threats of burning money in the same way it would react to "normal" threats. We propose four different methods, using techniques of prompting, fine-tuning, and scaffolding. We evaluate the four methods experimentally. We find that methods based on scaffolding and fine-tuning outperform simple prompting. In particular, fine-tuning and scaffolding more precisely implement the desired behavior w.r.t. threats against the surrogate goal. We also compare the different methods in terms of their side effects on capabilities and propensities in other situations. We find that scaffolding-based methods perform best.

1 Citations

0 Influential

2 Altmetric

11.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!