2601.12549v1 Jan 18, 2026 cs.CL

LLM에서 언어 간 개념 전이 현상 비교 분석

Benchmarking Concept-Spilling Across Languages in LLMs

Ilia Badanin

Citations: 33

h-index: 1

Daniil Dzenhaliou

Citations: 93

h-index: 4

Imanol Schlag

Citations: 2,814

h-index: 16

다국어 거대 언어 모델(LLM)은 뛰어난 언어 간 능력을 보여주지만, 종종 다른 언어의 표현에 대한 체계적인 편향을 나타내어, 비영어권 언어로 콘텐츠를 생성할 때 의미적 간섭을 일으키는 현상을 보입니다. 우리는 이러한 현상을 '언어 전이(language spilling)'라고 정의합니다. 본 논문에서는 다국어 모델의 의미적 안정성을 평가하기 위한 새로운 비교 프레임워크를 제시하며, 모델이 다양한 언어에서 다의어를 어떻게 처리하는지를 체계적으로 측정합니다. 우리의 방법론은 모델 성능에 대한 상대적인 지표를 제공합니다. 모델이 정확히 다섯 가지 의미를 생성해야 할 때, 강력한 모델과 약한 모델 모두 우세 언어의 의미를 사용할 수 있지만, 의미적으로 더 강력한 모델은 생성 과정에서 더 늦게 우세 언어의 의미를 사용하고, 대상 언어의 진정한 의미를 더 많이 생성한 후에 실패하는 반면, 약한 모델은 생성 과정에서 더 일찍 우세 언어의 의미를 사용합니다. 우리는 9개 언어에 걸쳐 구조화된 의미 생성 작업을 사용하여 다양한 공개 및 비공개 다국어 LLM을 평가하고, 신중하게 선별된 100개의 다의성 높은 영어 단어를 포함하는 벤치마크를 사용했습니다. 우리의 연구 결과는 모델과 언어 모두에서 의미적 안정성에 상당한 차이가 있음을 보여주며, 오류 원인의 명확한 인과적 귀속 없이 모델 비교를 위한 체계적인 순위 시스템을 제공합니다. 우리는 다국어 의미 평가를 위한 확장 가능한 비교 벤치마크와 엄격한 검증 파이프라인을 제공하며, 이는 더욱 언어적으로 균형 잡힌 AI 시스템을 개발하는 데 중요한 도구입니다.

Original Abstract

Multilingual Large Language Models (LLMs) exhibit remarkable cross-lingual abilities, yet often exhibit a systematic bias toward the representations from other languages, resulting in semantic interference when generating content in non-English languages$-$a phenomenon we define as language spilling. This paper presents a novel comparative framework for evaluating multilingual semantic robustness by systematically measuring how models handle polysemous words across languages. Our methodology provides a relative measure of model performance: when required to generate exactly five meanings, both strong and weak models may resort to meanings from dominant languages, but semantically stronger models do so later in the generation sequence, producing more true meanings from the target language before failing, while weaker models resort to dominant-language meanings earlier in the sequence. We evaluate a diverse set of open and closed multilingual LLMs using a structured meaning generation task across nine languages, employing a carefully curated benchmark of 100 high-polysemy English words. Our findings reveal significant variation in semantic robustness across both models and languages, providing a principled ranking system for model comparison without requiring definitive causal attribution of error sources. We contribute both a scalable comparative benchmark for multilingual semantic evaluation and a rigorous validation pipeline$-$critical tools for developing more linguistically balanced AI systems.

0 Citations

0 Influential

8 Altmetric

40.0 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!