2601.11516v4 Jan 16, 2026 cs.LG

제미니를 위한 프로덕션 환경에 적합한 탐지 시스템 구축

Building Production-Ready Probes For Gemini

Arthur Conmy

Citations: 38

h-index: 4

Neel Nanda

Citations: 11,488

h-index: 36

J'anos Kram'ar

Citations: 961

h-index: 7

Joshua Engels

Citations: 447

h-index: 9

Zheng Wang

Citations: 206

h-index: 7

Bilal Chughtai

Citations: 593

h-index: 11

Rohin Shah

Citations: 921

h-index: 9

최첨단 언어 모델의 기능은 빠르게 향상되고 있습니다. 따라서 악의적인 사용자가 점점 더 강력해지는 시스템을 오용하는 것을 방지하기 위한 강력한 대책이 필요합니다. 기존 연구에서는 활성화 탐지 시스템이 오용 방지 기술로서 유망할 수 있다는 점이 밝혀졌지만, 중요한 프로덕션 환경에서의 데이터 분포 변화에 대한 일반화 성능 부족이라는 주요 과제가 남아 있습니다. 특히, 기존의 탐지 시스템 아키텍처는 짧은 컨텍스트에서 긴 컨텍스트로의 입력 변화에 어려움을 겪는다는 것을 확인했습니다. 우리는 이러한 긴 컨텍스트 데이터 분포 변화에 대응할 수 있는 새로운 탐지 시스템 아키텍처를 제안합니다. 우리는 제안된 탐지 시스템을 사이버 공격 분야에서 평가하고, 다양한 프로덕션 환경과 관련된 데이터 분포 변화에 대한 견고성을 테스트했습니다. 여기에는 다중 턴 대화, 긴 컨텍스트 프롬프트 및 적응형 레드 팀 공격이 포함됩니다. 실험 결과, 새로운 아키텍처는 컨텍스트 길이를 처리하는 데 효과적이지만, 광범위한 일반화를 위해서는 아키텍처 선택과 다양한 데이터 분포를 활용한 학습이 필요함을 확인했습니다. 또한, 탐지 시스템과 프롬프트 기반 분류기를 함께 사용하는 것이 탐지 시스템의 계산 효율성 덕분에 최적의 정확도를 낮은 비용으로 달성할 수 있음을 보여줍니다. 이러한 연구 결과는 구글의 최첨단 언어 모델인 제미니의 사용자 인터페이스에 오용 방지 탐지 시스템을 성공적으로 배포하는 데 활용되었습니다. 마지막으로, AlphaEvolve를 사용하여 탐지 시스템 아키텍처 검색 및 적응형 레드 팀 공격 개선을 자동화하는 초기 긍정적인 결과를 얻었으며, 이는 일부 AI 안전 연구가 이미 자동화될 수 있음을 시사합니다.

Original Abstract

Frontier language model capabilities are improving rapidly. We thus need stronger mitigations against bad actors misusing increasingly powerful systems. Prior work has shown that activation probes may be a promising misuse mitigation technique, but we identify a key remaining challenge: probes fail to generalize under important production distribution shifts. In particular, we find that the shift from short-context to long-context inputs is difficult for existing probe architectures. We propose several new probe architectures that handle this long-context distribution shift. We evaluate these probes in the cyber-offensive domain, testing their robustness against various production-relevant distribution shifts, including multi-turn conversations, long context prompts, and adaptive red teaming. Our results demonstrate that while our novel architectures address context length, a combination of architecture choice and training on diverse distributions is required for broad generalization. Additionally, we show that pairing probes with prompted classifiers achieves optimal accuracy at a low cost due to the computational efficiency of probes. These findings have informed the successful deployment of misuse mitigation probes in user-facing instances of Gemini, Google's frontier language model. Finally, we find early positive results using AlphaEvolve to automate improvements in both probe architecture search and adaptive red teaming, showing that automating some AI safety research is already possible.

17 Citations

2 Influential

18 Altmetric

111.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!