2601.15679v1 Jan 22, 2026 cs.AI

도메인 전반에 걸친 에이전트 평가 방법론 개선: 민감 정보 유출, 사기 및 사이버 보안 위협

Improving Methodologies for Agentic Evaluations Across Domains: Leakage of Sensitive Information, Fraud and Cybersecurity Threats

Akriti Vij

Citations: 9

h-index: 1

Benjamin Chua

Citations: 4

h-index: 1

En Qi Ng

Citations: 0

h-index: 0

Mahran Morsidi

Citations: 6

h-index: 2

Sharmini Johnson

Citations: 0

h-index: 0

Vanessa Wilfred

Citations: 13

h-index: 1

Wan Sie Lee

Citations: 13

h-index: 1

Yongsen Zheng

Citations: 4

h-index: 1

Bill Black

Citations: 0

h-index: 0

Hao Zhang

Citations: 7

h-index: 1

Qinghua Lu

Citations: 7

h-index: 1

Suyu Ma

Citations: 1

h-index: 1

Fatemeh Azadi

Citations: 0

h-index: 0

Isar Nejadgholi

Citations: 3,724

h-index: 17

Sowmya Vajjala

Citations: 1,995

h-index: 21

Agnès Delaborde

Citations: 262

h-index: 10

Nicolas Rolin

Citations: 0

h-index: 0

Tom Seimandi

Citations: 1

h-index: 1

Akiko Murakami

Citations: 3

h-index: 1

Takayuki Semitsu

Citations: 1

h-index: 1

Angela Kinuthia

Citations: 0

h-index: 0

Jean Wangari

Citations: 0

h-index: 0

Michael Michie

Citations: 0

h-index: 0

Stephanie Kasaon

Citations: 17

h-index: 1

Hankyul Baek

Citations: 0

h-index: 0

Jae-won Noh

Citations: 8

h-index: 2

Kihyuk Nam

Citations: 23

h-index: 2

Sang Seo

Citations: 25

h-index: 3

Sungpil Shin

Citations: 16

h-index: 2

Taewhi Lee

Citations: 16

h-index: 2

Yongsu Kim

Citations: 0

h-index: 0

Ee Wei Seah

Citations: 0

h-index: 0

Naga Nikshith

Citations: 0

h-index: 0

Gabriel Waikin Loh Matienzo

Citations: 0

h-index: 0

Erin Zorer

Citations: 0

h-index: 0

Gareth Holvey

Citations: 0

h-index: 0

H. Coppock

Citations: 52

h-index: 3

Jerome Wynee

Citations: 0

h-index: 0

Magda Dubois

Citations: 335

h-index: 6

Michael Schmatz

Citations: 9

h-index: 1

Sam Deverett

Citations: 4

h-index: 1

Bo Yan

Citations: 1

h-index: 1

Bushra Sabir

Citations: 217

h-index: 7

Harriet Farlow

Citations: 8

h-index: 2

Li-ping Dong

Citations: 0

h-index: 0

Sharif Abuadbba

Citations: 622

h-index: 9

Tom Howroyd

Citations: 0

h-index: 0

Krishnapriya Vishnubhotla

Citations: 22

h-index: 3

Pulei Xiong

Citations: 1

h-index: 1

S. Lohrasbi

Citations: 1

h-index: 1

Scott Buffett

Citations: 19

h-index: 2

Shahrear Iqbal

Citations: 832

h-index: 15

Anna Safont-Andreu

Citations: 16

h-index: 3

L. Massarelli

Citations: 20

h-index: 1

O. V. D. Wal

Citations: 0

h-index: 0

Joris Dugu'ep'eroux

Citations: 25

h-index: 2

Romane Gallienne

Citations: 3

h-index: 1

Sarah Behanzin

Citations: 0

h-index: 0

Teresa Tsukiji

Citations: 0

h-index: 0

Frank Sun

Citations: 0

h-index: 0

A. Davidson

Citations: 379

h-index: 9

Patrick Keane

Citations: 103

h-index: 3

Helen Zhou

Citations: 59

h-index: 3

Seunghwan Jang

Citations: 23

h-index: 2

C. Fung

Citations: 103

h-index: 3

S. Møller

Citations: 11

h-index: 1

N. Gay

Citations: 102

h-index: 3

C. Devine

Citations: 29

h-index: 1

S. O'Callaghan

Citations: 45

h-index: 3

James Walpole

Citations: 12

h-index: 3

자율 AI 시스템의 급격한 부상과 에이전트 역량의 발전은 실제 상호작용에 대한 감독 감소로 인해 새로운 위험을 초래하고 있습니다. 그러나 에이전트 테스팅은 여전히 초기 단계에 머물러 있으며 계속 발전 중인 분야입니다. AI 에이전트가 전 세계적으로 배포되기 시작함에 따라, 다양한 언어와 문화를 정확하고 안전하게 처리하는 것이 중요해졌습니다. 이를 해결하기 위해 싱가포르, 일본, 호주, 캐나다, 유럽연합 집행위원회, 프랑스, 케냐, 한국, 영국의 대표들을 포함한 '국제 첨단 AI 측정·평가·과학 네트워크'의 참가자들이 모여 에이전트 평가에 대한 접근 방식을 조율했습니다. 이번 활동은 2024년 11월과 2025년 2월에 해당 네트워크가 수행한 두 차례의 이전 공동 테스팅 활동에서 얻은 통찰력을 바탕으로 진행된 세 번째 활동입니다. 그 목표는 첨단 AI 시스템 테스팅을 위한 모범 사례를 더욱 정교하게 다듬는 것입니다. 이번 활동은 두 가지 트랙으로 나뉘어 진행되었습니다. (1) 싱가포르 AISI가 주도한 민감 정보 유출 및 사기를 포함한 공통 위험, (2) 영국 AISI가 주도한 사이버 보안입니다. 다양한 공개 에이전트 벤치마크의 과제들을 대상으로 오픈 웨이트 및 폐쇄형 모델들이 혼합되어 평가되었습니다. 에이전트 테스팅이 아직 초기 단계임을 감안하여, 우리의 주된 초점은 테스트 결과나 모델의 역량을 검토하기보다는 이러한 테스트를 수행하는 과정에서의 방법론적 문제를 이해하는 데 있었습니다. 이번 협력은 참가자들이 함께 에이전트 평가 과학을 발전시키기 위해 노력함에 있어 중요한 진전을 의미합니다.

Original Abstract

The rapid rise of autonomous AI systems and advancements in agent capabilities are introducing new risks due to reduced oversight of real-world interactions. Yet agent testing remains nascent and is still a developing science. As AI agents begin to be deployed globally, it is important that they handle different languages and cultures accurately and securely. To address this, participants from The International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the European Commission, France, Kenya, South Korea, and the United Kingdom have come together to align approaches to agentic evaluations. This is the third exercise, building on insights from two earlier joint testing exercises conducted by the Network in November 2024 and February 2025. The objective is to further refine best practices for testing advanced AI systems. The exercise was split into two strands: (1) common risks, including leakage of sensitive information and fraud, led by Singapore AISI; and (2) cybersecurity, led by UK AISI. A mix of open and closed-weight models were evaluated against tasks from various public agentic benchmarks. Given the nascency of agentic testing, our primary focus was on understanding methodological issues in conducting such tests, rather than examining test results or model capabilities. This collaboration marks an important step forward as participants work together to advance the science of agentic evaluations.

0 Citations

0 Influential

10.5 Altmetric

52.5 Score

Original PDF

AI Analysis

Korean Summary

이 보고서는 국제 AI 안전 연구소 네트워크(싱가포르, 영국, 한국, 미국, 일본 등)가 공동으로 수행한 '에이전트형 AI(Agentic AI)'에 대한 안전성 평가 결과입니다. 평가는 크게 두 가지 트랙으로 진행되었습니다. 첫째, 9개 언어(한국어 포함)를 대상으로 한 '사기(Fraud) 및 민감 정보 유출' 위험 평가, 둘째, 에이전트의 '사이버 보안' 능력 평가입니다. 연구 결과, 에이전트 모델은 일반 대화형 모델보다 안전성 통과율이 낮았으며, 특히 비영어권 언어에서 안전장치가 약화되는 경향이 확인되었습니다. 또한, 모델을 평가자(Judge)로 사용할 경우 인간 평가자보다 관대한 경향이 있어 불일치가 발생했습니다. 사이버 보안 측면에서는 모델들이 복잡한 작업(CTF 등)에서 낮은 성공률을 보였으며, 토큰 제한이나 인프라 오류(VM 버그)가 성능 평가에 큰 변수로 작용함이 밝혀졌습니다.

Key Innovations

9개 언어(한국어, 스와힐리어 등)를 아우르는 다국어 에이전트 안전성 테스트 프레임워크 적용
모델을 '행위자(Agent)'와 '평가자(Judge)'로 구분하여 수행 능력과 평가 능력을 동시에 검증
악의적 사용자, 프롬프트 주입, 불명확한 지시 등 세분화된 위험 시나리오 기반의 테스트 설계
사이버 보안 에이전트의 성능 분석을 위해 계층적 베이지안 모델(HiBayES) 및 토큰 효율성 분석 도입
평가 과정에서 인프라(VM) 오류 및 번역 품질이 에이전트 성능 측정에 미치는 영향을 정량적으로 분석

Learning & Inference Impact

이 연구는 에이전트형 AI 개발 시 단순한 언어 모델 학습을 넘어 도구 사용(Tool use)과 복잡한 추론 과정에서의 안전성 정렬(Alignment)이 필수적임을 시사합니다. 특히 추론 과정에서 영어와 비영어권 언어 간의 성능 격차(English Tax)가 뚜렷하며, 비영어권 언어에서는 안전 거부(Refusal) 메커니즘이 제대로 작동하지 않거나 환각(Hallucination)이 발생할 가능성이 높습니다. 따라서 개발자들은 다국어 도구 호출 데이터셋을 확충하고, 모델이 시뮬레이션 환경임을 인지하거나 인프라 오류에 대응하는 로버스트(Robustness) 훈련을 강화해야 합니다. 또한, 평가자로서의 LLM은 아직 인간을 완전히 대체하기 어려우므로, 정교한 평가 프롬프트 엔지니어링과 인간 피드백이 병행되어야 합니다.

Technical Difficulty

중급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!