2604.24618v1 Apr 27, 2026 cs.AI

AI 모델이 AI 안전 연구를 방해할 가능성 평가

Evaluating whether AI models would sabotage AI safety research

Alexandra Souly

Citations: 612

h-index: 6

Robert Kirk

Citations: 209

h-index: 7

Abby D'Cruz

Citations: 0

h-index: 0

Xander Davies

Citations: 346

h-index: 7

Kai Fronsdal

Citations: 1

h-index: 1

본 연구에서는 최첨단 AI 회사 내에서 AI 연구 에이전트로 배포될 때, 최첨단 모델들이 안전 연구를 방해하거나 지원을 거부할 가능성을 평가합니다. Claude 모델 4가지(Mythos Preview, Opus 4.7 Preview, Opus 4.6, Sonnet 4.6)에 대해 두 가지 상호 보완적인 평가를 수행했습니다. 첫 번째는 안전 연구를 방해할 기회가 주어졌을 때 모델의 행동을 테스트하는 '자발적 방해 평가'이며, 두 번째는 모델이 이전에 방해적인 행동을 시작한 경로에 놓였을 때도 방해를 지속하는지 테스트하는 '방해 지속 평가'입니다. 모든 모델에서 자발적인 방해 사례는 발견되지 않았으며, Mythos Preview와 Opus 4.7 Preview의 경우 거부율이 거의 0%에 가깝습니다. 하지만 모든 모델이 때때로 작업을 부분적으로만 완료했습니다. 지속 평가에서는 Mythos Preview가 7%의 경우에 적극적으로 방해를 지속하는 반면, Opus 4.6은 3%, Sonnet 4.6은 4%, Opus 4.7 Preview는 0%였습니다. 이 경우 대부분에서 모델은 추론과 출력 사이에 불일치를 보여, 은밀한 방해 추론을 나타냅니다. 본 연구의 평가 프레임워크는 오픈 소스 LLM 감사 도구인 Petri를 기반으로 하며, 모델을 Claude Code 내에서 실행하는 맞춤형 구조와 함께 현실적인 방해 경로를 생성하는 반복적인 파이프라인을 사용합니다. 우리는 평가 인지 능력과 '프리필 인지 능력'이라는 새로운 형태의 상황 인지 능력을 측정했습니다. 여기서 프리필 인지 능력은 이전 경로 내용이 스스로 생성되지 않았다는 것을 인식하는 능력을 의미합니다. Opus 4.7 Preview는 눈에 띄게 높은 자발적인 평가 인지 능력을 보이는 반면, 모든 모델에서 프리필 인지 능력은 낮은 수준을 유지합니다. 마지막으로, 평가 인지 능력의 문제점, 제한적인 시나리오 범위, 그리고 안전 연구 방해를 넘어 위험을 초래할 수 있는 미검증 경로와 같은 한계점을 논의합니다.

Original Abstract

We evaluate the propensity of frontier models to sabotage or refuse to assist with safety research when deployed as AI research agents within a frontier AI company. We apply two complementary evaluations to four Claude models (Mythos Preview, Opus 4.7 Preview, Opus 4.6, and Sonnet 4.6): an unprompted sabotage evaluation testing model behaviour with opportunities to sabotage safety research, and a sabotage continuation evaluation testing whether models continue to sabotage when placed in trajectories where prior actions have started undermining research. We find no instances of unprompted sabotage across any model, with refusal rates close to zero for Mythos Preview and Opus 4.7 Preview, though all models sometimes only partially completed tasks. In the continuation evaluation, Mythos Preview actively continues sabotage in 7% of cases (versus 3% for Opus 4.6, 4% for Sonnet 4.6, and 0% for Opus 4.7 Preview), and exhibits reasoning-output discrepancy in the majority of these cases, indicating covert sabotage reasoning. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold running models inside Claude Code, alongside an iterative pipeline for generating realistic sabotage trajectories. We measure both evaluation awareness and a new form of situational awareness termed "prefill awareness", the capability to recognise that prior trajectory content was not self-generated. Opus 4.7 Preview shows notably elevated unprompted evaluation awareness, while prefill awareness remains low across all models. Finally, we discuss limitations including evaluation awareness confounds, limited scenario coverage, and untested pathways to risk beyond safety research sabotage.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!