2604.00788v1 Apr 01, 2026 cs.AI

영국 AI 보안 연구소(AISI)의 목표 달성도 평가 사례 연구

UK AISI Alignment Evaluation Case-Study

Alexandra Souly

Citations: 612

h-index: 6

Robert Kirk

Citations: 209

h-index: 7

Jacob Merizian

Citations: 72

h-index: 3

Abby D'Cruz

Citations: 0

h-index: 0

Xander Davies

Citations: 346

h-index: 7

본 기술 보고서는 영국 AI 보안 연구소에서 개발한 방법론을 소개하며, 고급 AI 시스템이 의도된 목표를 안정적으로 따르는지 평가하는 방법을 제시합니다. 특히, 인공지능 연구실 내에서 코딩 도우미로 사용될 때, 최첨단 모델이 안전 연구를 방해하는지 평가합니다. 저희 방법론을 사용하여 네 가지 최첨단 모델을 분석한 결과, 연구 방해 사례는 확인되지 않았습니다. 그러나 Claude Opus 4.5 Preview (Opus 4.5의 사전 출시 버전)와 Sonnet 4.5는 안전 관련 연구 과제에 자주 응답하지 않으며, 이는 연구 방향, 자체 학습 참여, 연구 범위에 대한 우려 때문인 것으로 나타났습니다. 또한, Opus 4.5 Preview는 Sonnet 4.5에 비해 평가 인지 능력이 낮지만, 두 모델 모두 프롬프트가 주어질 경우 평가와 실제 사용 시나리오를 구별할 수 있는 것으로 나타났습니다. 저희의 평가 프레임워크는 오픈 소스 LLM 감사 도구인 Petri를 기반으로 하며, 코딩 에이전트의 실제 내부 사용 환경을 시뮬레이션하도록 설계된 추가 모듈을 포함합니다. 이 시뮬레이션 환경이 실제 사용 데이터와 구별하기 어려운 결과를 생성한다는 것을 검증했습니다. 저희는 연구 동기, 활동 유형, 대체 위협, 모델 자율성 등 다양한 요소를 고려하여 모델을 테스트했습니다. 마지막으로, 시나리오 범위 및 평가 인지 능력과 관련된 한계점을 논의합니다.

Original Abstract

This technical report presents methods developed by the UK AI Security Institute for assessing whether advanced AI systems reliably follow intended goals. Specifically, we evaluate whether frontier models sabotage safety research when deployed as coding assistants within an AI lab. Applying our methods to four frontier models, we find no confirmed instances of research sabotage. However, we observe that Claude Opus 4.5 Preview (a pre-release snapshot of Opus 4.5) and Sonnet 4.5 frequently refuse to engage with safety-relevant research tasks, citing concerns about research direction, involvement in self-training, and research scope. We additionally find that Opus 4.5 Preview shows reduced unprompted evaluation awareness compared to Sonnet 4.5, while both models can distinguish evaluation from deployment scenarios when prompted. Our evaluation framework builds on Petri, an open-source LLM auditing tool, with a custom scaffold designed to simulate realistic internal deployment of a coding agent. We validate that this scaffold produces trajectories that all tested models fail to reliably distinguish from real deployment data. We test models across scenarios varying in research motivation, activity type, replacement threat, and model autonomy. Finally, we discuss limitations including scenario coverage and evaluation awareness.

0 Citations

0 Influential

3.5 Altmetric

17.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!