2603.03258v1 Mar 03, 2026 cs.AI

유전적 목표 변이: 맥락적 압력은 자율적인 목표를 약화시킬 수 있다

Inherited Goal Drift: Contextual Pressure Can Undermine Agentic Goals

Achyutha Menon

Citations: 1

h-index: 1

Magnus Saebo

Citations: 2

h-index: 1

Tyler Crosse

Citations: 1

h-index: 1

Spencer J. Gibson

Citations: 18

h-index: 3

Eyon Jang

Citations: 1

h-index: 1

Diogo Cruz

Citations: 29

h-index: 3

장기적인 맥락을 처리하는 작업에서 언어 모델(LM)이 에이전트로 널리 사용됨에 따라, 에이전트가 원래 목표에서 벗어나는 경향인 '목표 변이' 현상에 대한 심층적인 이해가 필요합니다. 기존 연구에서 이전 세대의 언어 모델 에이전트는 목표 변이에 취약한 것으로 나타났지만, 최신 모델에 미치는 영향은 아직 명확하지 않습니다. 본 연구에서는 목표 변이의 범위와 원인에 대한 최신 분석을 제공합니다. 우리는 최첨단 모델을 시뮬레이션된 주식 거래 환경(Arike et al., 2025)에서 테스트하여 목표 변이를 조사했습니다. 이러한 모델들은 대부분 적대적인 압력에도 강건한 것으로 나타났습니다. 그러나, 동일한 모델들이 약한 에이전트가 생성한 미리 정의된 경로에 조건화될 때, 종종 목표 변이를 '유전'받는다는 것을 확인했습니다. 조건화에 의한 목표 변이의 정도는 모델 유형에 따라 크게 다르며, 테스트된 모델 중 GPT-5.1만이 일관된 강건성을 유지했습니다. 목표 변이 행동은 프롬프트 변형에 따라 일관되지 않으며, 지시 체계 준수 행동과도 낮은 상관관계를 보입니다. 즉, 체계적인 지시 준수가 반드시 목표 변이에 대한 저항성을 보장하지 않습니다. 마지막으로, 우리는 질적으로 다른 환경인 응급실 진료 환경에서 유사한 실험을 수행하여, 본 연구 결과의 다른 환경으로의 적용 가능성에 대한 초기 증거를 제시합니다. 본 연구 결과는 최신 LM 에이전트가 여전히 맥락적 압력에 취약하며, 이러한 문제를 완화하기 위한 정교한 사후 훈련 기술의 필요성을 강조합니다.

Original Abstract

The accelerating adoption of language models (LMs) as agents for deployment in long-context tasks motivates a thorough understanding of goal drift: agents' tendency to deviate from an original objective. While prior-generation language model agents have been shown to be susceptible to drift, the extent to which drift affects more recent models remains unclear. In this work, we provide an updated characterization of the extent and causes of goal drift. We investigate drift in state-of-the-art models within a simulated stock-trading environment (Arike et al., 2025). These models are largely shown to be robust even when subjected to adversarial pressure. We show, however, that this robustness is brittle: across multiple settings, the same models often inherit drift when conditioned on prefilled trajectories from weaker agents. The extent of conditioning-induced drift varies significantly by model family, with only GPT-5.1 maintaining consistent resilience among tested models. We find that drift behavior is inconsistent between prompt variations and correlates poorly with instruction hierarchy following behavior, with strong hierarchy following failing to reliably predict resistance to drift. Finally, we run analogous experiments in a new emergency room triage environment to show preliminary evidence for the transferability of our results across qualitatively different settings. Our findings underscore the continued vulnerability of modern LM agents to contextual pressures and the need for refined post-training techniques to mitigate this.

1 Citations

0 Influential

1.5 Altmetric

8.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!