2601.15706v1 Jan 22, 2026 cs.AI

글로벌 언어 전반의 LLM 평가를 위한 방법론 개선

Improving Methodologies for LLM Evaluations Across Global Languages

Akriti Vij

Citations: 9

h-index: 1

Benjamin Chua

Citations: 4

h-index: 1

Darshini Ramiah

Citations: 1

h-index: 1

En Qi Ng

Citations: 0

h-index: 0

Mahran Morsidi

Citations: 6

h-index: 2

Naga Nikshith Gangarapu

Citations: 0

h-index: 0

Sharmini Johnson

Citations: 0

h-index: 0

Vanessa Wilfred

Citations: 13

h-index: 1

V. Kumaran

Citations: 0

h-index: 0

Wan Sie Lee

Citations: 13

h-index: 1

Yongsen Zheng

Citations: 4

h-index: 1

Bill Black

Citations: 0

h-index: 0

Boming Xia

Citations: 398

h-index: 10

Hao Zhang

Citations: 7

h-index: 1

Qinghua Lu

Citations: 7

h-index: 1

Suyu Ma

Citations: 1

h-index: 1

Yue Liu

Citations: 5

h-index: 1

Chi-kiu Lo

Citations: 0

h-index: 0

Fatemeh Azadi

Citations: 0

h-index: 0

Isar Nejadgholi

Citations: 3,724

h-index: 17

Sowmya Vajjala

Citations: 1,995

h-index: 21

Agnès Delaborde

Citations: 262

h-index: 10

Nicolas Rolin

Citations: 0

h-index: 0

Tom Seimandi

Citations: 1

h-index: 1

Akiko Murakami

Citations: 3

h-index: 1

Haruto Ishi

Citations: 0

h-index: 0

Takayuki Semitsu

Citations: 1

h-index: 1

Angela Kinuthia

Citations: 0

h-index: 0

Jean Wangari

Citations: 0

h-index: 0

Michael Michie

Citations: 0

h-index: 0

Stephanie Kasaon

Citations: 17

h-index: 1

Hankyul Baek

Citations: 0

h-index: 0

Jae-won Noh

Citations: 8

h-index: 2

Kihyuk Nam

Citations: 23

h-index: 2

Sang Seo

Citations: 25

h-index: 3

Sungpil Shin

Citations: 16

h-index: 2

Taewhi Lee

Citations: 16

h-index: 2

Yongsu Kim

Citations: 0

h-index: 0

Daisy Newbold-Harrop

Citations: 0

h-index: 0

Wenzhu Yang

Citations: 8

h-index: 2

Frank Sun

Citations: 0

h-index: 0

Satoshi Sekine

Citations: 56

h-index: 5

T. Sasaki

Citations: 0

h-index: 0

Jessica Wang

Citations: 354

h-index: 3

M. Ghanem

Citations: 37

h-index: 3

Vy Hong

Citations: 0

h-index: 0

최첨단 AI 모델이 전 세계적으로 배포됨에 따라, 다양한 언어 및 문화적 맥락에서 모델의 동작이 안전하고 신뢰할 수 있도록 유지하는 것이 필수적입니다. 이러한 환경에서 현재의 모델 안전 장치가 얼마나 잘 작동하는지 조사하기 위해, 싱가포르, 일본, 호주, 캐나다, EU, 프랑스, 케냐, 한국 및 영국의 대표를 포함한 '국제 첨단 AI 측정, 평가 및 과학 네트워크' 참가자들이 공동 다국어 평가를 수행했습니다. 싱가포르 AISI가 주도하여, 고자원 및 저자원 언어 그룹을 아우르는 10개 언어(광동어, 영어, 파르시어, 프랑스어, 일본어, 한국어, 키스와힐리어, 말레이어, 중국어 만다린, 텔루구어)에 대해 두 가지 오픈 웨이트 모델을 테스트했습니다. 6,000개 이상의 새로 번역된 프롬프트가 5가지 유해성 범주(개인정보, 비폭력 범죄, 폭력 범죄, 지식재산권, 제일브레이크 강건성)에 걸쳐 평가되었으며, 이 과정에서 '심사위원으로서의 LLM(LLM-as-a-judge)' 방식과 인간 주석 방식이 모두 사용되었습니다. 이 평가는 안전 관련 동작이 언어에 따라 어떻게 달라질 수 있는지를 보여줍니다. 여기에는 언어 및 유해성 유형에 따른 안전 장치 견고성의 차이, 그리고 평가자 신뢰도(LLM 심사 대 인간 검토)의 변동이 포함됩니다. 또한, 문화적 맥락을 반영한 번역, 스트레스 테스트를 거친 평가 프롬프트, 더 명확한 인간 주석 가이드라인의 필요성과 같이 다국어 안전성 평가를 개선하기 위한 방법론적 통찰력도 도출했습니다. 이 연구는 첨단 AI 시스템의 다국어 안전성 테스트를 위한 공유 프레임워크를 향한 첫 걸음이며, 더 넓은 연구 커뮤니티 및 산업계와의 지속적인 협력을 촉구합니다.

Original Abstract

As frontier AI models are deployed globally, it is essential that their behaviour remains safe and reliable across diverse linguistic and cultural contexts. To examine how current model safeguards hold up in such settings, participants from the International Network for Advanced AI Measurement, Evaluation and Science, including representatives from Singapore, Japan, Australia, Canada, the EU, France, Kenya, South Korea and the UK conducted a joint multilingual evaluation exercise. Led by Singapore AISI, two open-weight models were tested across ten languages spanning high and low resourced groups: Cantonese English, Farsi, French, Japanese, Korean, Kiswahili, Malay, Mandarin Chinese and Telugu. Over 6,000 newly translated prompts were evaluated across five harm categories (privacy, non-violent crime, violent crime, intellectual property and jailbreak robustness), using both LLM-as-a-judge and human annotation. The exercise shows how safety behaviours can vary across languages. These include differences in safeguard robustness across languages and harm types and variation in evaluator reliability (LLM-as-judge vs. human review). Further, it also generated methodological insights for improving multilingual safety evaluations, such as the need for culturally contextualised translations, stress-tested evaluator prompts and clearer human annotation guidelines. This work represents an initial step toward a shared framework for multilingual safety testing of advanced AI systems and calls for continued collaboration with the wider research community and industry.

0 Citations

0 Influential

10.5 Altmetric

52.5 Score

Original PDF

AI Analysis

Korean Summary

이 문서는 싱가포르, 일본, 한국, 영국 등 10개국 AI 안전 연구소(AISI) 및 관련 기관이 협력하여 수행한 '다국어 LLM 안전성 평가 공동 테스트' 결과 보고서입니다. 10개 언어(한국어, 영어, 프랑스어, 일본어 등)와 5개 위해 범주(프라이버시, 범죄, 지적 재산권, 탈옥 등)에 대해 두 가지 오픈 웨이트 모델(Mistral Large, Gemma 2)을 테스트했습니다. 연구 결과, 비영어권 언어의 안전장치가 영어에 비해 다소 미흡하며, 특히 탈옥(Jailbreaking) 공격에 취약한 것으로 나타났습니다. 또한 자동화된 평가(LLM-as-a-judge)와 인간 평가 간의 불일치를 분석하여, 다국어 환경에서 AI 심판 모델의 한계와 인간 감독의 필요성을 강조하고 있습니다.

Key Innovations

10개국 기관이 협력하여 구축한 표준화된 다국어 AI 안전성 평가 접근 방식
LLM-as-a-judge(심판 모델)와 인간 평가자 간의 신뢰성 및 불일치율 비교 분석
단순 번역을 넘어 문화적 맥락(예: 거절의 정중함, 현지 법률)을 고려한 평가 시도
경고 메시지를 출력하면서도 유해한 내용을 제공하는 '표면적 거절(Superficial Warnings)' 현상 식별
고자원 언어와 저자원 언어(예: 스와힐리어, 텔루구어) 간의 안전성 격차 규명

Learning & Inference Impact

이 연구 결과는 모델 학습 시 영어 데이터 중심의 안전장치 튜닝이 다국어 환경에서는 충분하지 않음을 시사합니다. 추론 및 배포 단계에서 모델은 언어별 문화적 뉘앙스에 따라 거절의 강도나 방식이 달라질 수 있으며, 특히 다국어 적대적 공격(Adversarial Attacks)에 대한 방어력을 높이기 위해 다국어 데이터셋을 활용한 추가적인 학습이 필요합니다. 평가 방법론 측면에서는 프롬프트의 직역이 아닌 문화적 현지화(Localization)가 필수적이며, 자동화된 평가 모델을 맹신하기보다 인간의 검토를 병행하는 하이브리드 평가 파이프라인 구축이 중요함을 강조합니다.

Technical Difficulty

중급

Estimated implementation complexity based on methodology.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!