2605.15081v1 May 14, 2026 cs.CL

ML-Embed: 다국어 환경을 위한 포괄적이고 효율적인 임베딩

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

Peng Di

Citations: 55

h-index: 5

Ziyin Zhang

Citations: 404

h-index: 9

Zihan Liao

Citations: 52

h-index: 5

Hang Yu

Citations: 13

h-index: 1

Rui Wang

Citations: 121

h-index: 6

고품질 텍스트 임베딩 개발은 점점 더 배타적인 미래로 향하고 있으며, 이는 세 가지 중요한 장벽에 의해 정의됩니다. 이러한 장벽은 다음과 같습니다: 터무니없이 높은 계산 비용, 대부분의 세계 언어를 소외시키는 좁은 언어적 범위, 그리고 폐쇄형 소스 또는 개방형 가중치 모델에서 비롯되는 투명성 부족으로 인한 연구 활동의 저해. 이러한 장벽을 해소하기 위해, 우리는 새로운 프레임워크인 3차원 마트료시카 학습(3D-ML)을 기반으로 구축된 포괄적이고 효율적인 모델 모음인 ML-Embed를 소개합니다. 우리의 프레임워크는 전체 모델 수명 주기에 걸쳐 포괄적인 효율성을 제공하여 계산 문제를 해결합니다. 마트료시카 표현 학습(MRL)의 저장 공간 효율성과 마트료시카 레이어 학습(MLL)이 제공하는 유연한 추론 시간 깊이 외에도, 우리는 향상된 파라미터 효율성을 위한 마트료시카 임베딩 학습(MEL)을 도입했습니다. 언어적 문제를 해결하기 위해, 우리는 방대한 다국어 데이터 세트를 큐레이션하고 140M에서 8B 파라미터에 이르는 다양한 모델을 학습했습니다. 투명성에 대한 확고한 의지를 바탕으로, 우리는 모든 모델, 데이터 및 코드를 공개합니다. 430개의 작업에 대한 광범위한 평가 결과, 우리 모델은 17개의 평가된 MTEB 벤치마크 중 9개에서 새로운 최고 기록을 달성했으며, 특히 저자원 언어에서 뛰어난 성능을 보였습니다. 이는 전 세계적으로 공정하고 계산 효율적인 AI 시스템을 구축하기 위한 재현 가능한 청사진을 제공합니다.

Original Abstract

The development of high-quality text embeddings is increasingly drifting toward an exclusionary future, defined by three critical barriers: prohibitive computational costs, a narrow linguistic focus that neglects most of the world's languages, and a lack of transparency from closed-source or open-weight models that stifles research. To dismantle these barriers, we introduce ML-Embed, a suite of inclusive and efficient models built upon a new framework: 3-Dimensional Matryoshka Learning (3D-ML). Our framework addresses the computational challenge with comprehensive efficiency across the entire model lifecycle. Beyond the storage benefits of Matryoshka Representation Learning (MRL) and flexible inference-time depth provided by Matryoshka Layer Learning (MLL), we introduce Matryoshka Embedding Learning (MEL) for enhanced parameter efficiency. To address the linguistic challenge, we curate a massively multilingual dataset and train a suite of models ranging from 140M to 8B parameters. In a direct commitment to transparency, we release all models, data, and code. Extensive evaluation on 430 tasks demonstrates that our models set new records on 9 of 17 evaluated MTEB benchmarks, with particularly strong results in low-resource languages, providing a reproducible blueprint for building globally equitable and computationally efficient AI systems.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!