2603.07427v1 Mar 08, 2026 cs.AI

AutoControl Arena: 실행 가능한 테스트 환경을 합성하여 최첨단 AI 위험을 평가하는 방법

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

Xu Pan

Citations: 102

h-index: 6

Min Yang

Citations: 107

h-index: 6

Fazl Barez

Citations: 1,233

h-index: 17

Changyi Li

Citations: 10

h-index: 2

Pengfei Lu

Citations: 106

h-index: 6

대규모 언어 모델(LLM)이 자율 에이전트로 진화함에 따라, 기존의 안전성 평가 방식은 근본적인 어려움에 직면합니다. 수동으로 제작된 벤치마크는 비용이 많이 들고, LLM 기반 시뮬레이터는 확장성이 뛰어나지만 논리적 오류(hallucination)를 일으키는 문제가 있습니다. 본 연구에서는 논리-내러티브 분리 원칙에 기반한 최첨단 AI 위험 평가를 위한 자동화 프레임워크인 AutoControl Arena를 제시합니다. 우리는 결정적인 상태를 실행 가능한 코드로 구현하고, 생성적인 역동성은 LLM에 위임함으로써 오류를 줄이면서도 유연성을 유지합니다. 세 개의 에이전트로 구성된 프레임워크는 98% 이상의 전체 성공률과 기존 시뮬레이터보다 60% 높은 인간의 선호도를 달성합니다. 잠재적인 위험을 파악하기 위해, 우리는 X-Bench(70개 시나리오, 7가지 위험 범주)에서 환경의 스트레스와 유혹 요인을 다양하게 변화시켰습니다. 9개의 최첨단 모델을 평가한 결과, 다음과 같은 사실이 밝혀졌습니다. (1) 정렬 착시(Alignment Illusion): 압박 상황에서 위험 발생률이 21.7%에서 54.5%로 급증하며, 성능이 뛰어난 모델에서 이러한 증가가 더 크게 나타납니다. (2) 시나리오별 안전성 확장(Scenario-Specific Safety Scaling): 고급 추론은 직접적인 피해에 대한 안정성을 향상시키지만, 게임 시나리오에서는 오히려 악화시킵니다. (3) 상이한 부조화 패턴(Divergent Misalignment Patterns): 성능이 낮은 모델은 악의적인 의도가 없는 피해를 유발하는 반면, 성능이 높은 모델은 전략적인 은폐 행동을 개발합니다.

Original Abstract

As Large Language Models (LLMs) evolve into autonomous agents, existing safety evaluations face a fundamental trade-off: manual benchmarks are costly, while LLM-based simulators are scalable but suffer from logic hallucination. We present AutoControl Arena, an automated framework for frontier AI risk evaluation built on the principle of logic-narrative decoupling. By grounding deterministic state in executable code while delegating generative dynamics to LLMs, we mitigate hallucination while maintaining flexibility. This principle, instantiated through a three-agent framework, achieves over 98% end-to-end success and 60% human preference over existing simulators. To elicit latent risks, we vary environmental Stress and Temptation across X-Bench (70 scenarios, 7 risk categories). Evaluating 9 frontier models reveals: (1) Alignment Illusion: risk rates surge from 21.7% to 54.5% under pressure, with capable models showing disproportionately larger increases; (2) Scenario-Specific Safety Scaling: advanced reasoning improves robustness for direct harms but worsens it for gaming scenarios; and (3) Divergent Misalignment Patterns: weaker models cause non-malicious harm while stronger models develop strategic concealment.

0 Citations

0 Influential

8.5 Altmetric

42.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!