2604.10590v1 Apr 12, 2026 cs.CL

언어적 장벽 해소: 사전 학습 및 데이터셋에서의 교차 언어 매핑을 통한 향상된 다국어 LLM 성능

Bridging Linguistic Gaps: Cross-Lingual Mapping in Pre-Training and Dataset for Enhanced Multilingual LLM Performance

AiTi Aw

Citations: 197

h-index: 5

Weihua Zheng

Citations: 7

h-index: 1

Chang Liu

Citations: 15

h-index: 2

Zhengyuan Liu

Citations: 26

h-index: 2

Xin Huang

Citations: 7

h-index: 1

Kui Wu

Citations: 19

h-index: 3

Muhammad Huzaifah

Citations: 14

h-index: 3

Roy Ka-wei Lee

Citations: 10

h-index: 1

다국어 대규모 언어 모델(LLM)은 고자원 언어와 저자원 언어 간의 데이터 불균형, 그리고 사전 학습 과정에서의 단일 언어 편향으로 인해 교차 언어 작업에서 어려움을 겪습니다. 기존의 양방향 미세 조정 및 대비 정렬과 같은 방법은 교차 언어 성능을 향상시킬 수 있지만, 종종 광범위한 병렬 데이터가 필요하거나 불안정성을 겪습니다. 이러한 문제점을 해결하기 위해, 우리는 사전 학습 단계에서 교차 언어 매핑 작업을 도입하여 단일 언어의 유창성을 손상시키지 않으면서 교차 언어 정렬을 강화합니다. 우리의 접근 방식은 LLM 임베딩 공간 내에서 언어를 양방향으로 매핑하여 언어 생성 및 이해 능력을 모두 향상시킵니다. 또한, 제한된 데이터 환경에서도 교차 언어 일관성을 견고하게 측정할 수 있는 언어 정렬 계수를 제안합니다. 기계 번역(MT), 교차 언어 자연어 이해(CLNLU), 그리고 교차 언어 질문 답변(CLQA)에 대한 실험 결과는, 우리의 모델이 MT에서 최대 11.9 BLEU 포인트, CLQA BERTScore-Precision에서 6.72 포인트, 그리고 CLNLU 정확도에서 5% 이상의 성능 향상을 보여주며, 강력한 다국어 기준 모델을 능가함을 보여줍니다. 이러한 결과는 사전 학습에 교차 언어 목표를 통합하여 다국어 LLM을 개선할 수 있는 잠재력을 강조합니다.

Original Abstract

Multilingual Large Language Models (LLMs) struggle with cross-lingual tasks due to data imbalances between high-resource and low-resource languages, as well as monolingual bias in pre-training. Existing methods, such as bilingual fine-tuning and contrastive alignment, can improve cross-lingual performance, but they often require extensive parallel data or suffer from instability. To address these challenges, we introduce a Cross-Lingual Mapping Task during the pre-training phase, which enhances cross-lingual alignment without compromising monolingual fluency. Our approach bi-directionally maps languages within the LLM embedding space, improving both language generation and comprehension. We further propose a Language Alignment Coefficient to robustly quantify cross-lingual consistency, even in limited-data scenarios. Experimental results on machine translation (MT), cross-lingual natural language understanding (CLNLU), and cross-lingual question answering (CLQA) show that our model achieves gains of up to 11.9 BLEU points in MT, 6.72 points in CLQA BERTScore-Precision, and more than 5% in CLNLU accuracy over strong multilingual baselines. These findings highlight the potential of incorporating cross-lingual objectives into pre-training to improve multilingual LLMs.

0 Citations

0 Influential

2.5 Altmetric

12.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!