2603.19223v1 Mar 19, 2026 cs.CL

F2LLM-v2: 포괄적이고, 성능이 뛰어나며, 효율적인 다국어 임베딩 모델

F2LLM-v2: Inclusive, Performant, and Efficient Embeddings for a Multilingual World

Peng Di

Citations: 35

h-index: 4

Ziyin Zhang

Citations: 358

h-index: 9

Zihan Liao

Citations: 41

h-index: 4

Han Yu

Citations: 49

h-index: 3

Rui Wang

Citations: 19

h-index: 1

본 논문에서는 80M에서 14B까지 다양한 크기의 범용 다국어 임베딩 모델인 F2LLM-v2 패밀리를 소개합니다. F2LLM-v2는 6천만 개의 공개적으로 사용 가능한 고품질 데이터 세트를 기반으로 훈련되었으며, 200개 이상의 언어를 지원하며, 특히 이전에 충분한 자원이 부족했던 중간 및 저자원 언어에 중점을 둡니다. 저희는 두 단계의 LLM 기반 임베딩 훈련 파이프라인과 매트료시카 학습, 모델 가지치기 및 지식 증류 기술을 통합하여 이전의 LLM 기반 임베딩 모델보다 훨씬 효율적이지만 경쟁력 있는 성능을 유지하는 모델을 개발했습니다. 광범위한 평가 결과, F2LLM-v2-14B 모델이 11개의 MTEB 벤치마크에서 1위를 차지했으며, 패밀리의 더 작은 모델들도 리소스 제약적인 환경에서 새로운 최고 성능을 달성했습니다. 오픈 소스 임베딩 모델 연구를 촉진하기 위해, 저희는 모든 모델, 데이터, 코드 및 중간 체크포인트를 공개합니다.

Original Abstract

We present F2LLM-v2, a new family of general-purpose, multilingual embedding models in 8 distinct sizes ranging from 80M to 14B. Trained on a newly curated composite of 60 million publicly available high-quality data samples, F2LLM-v2 supports more than 200 languages, with a particular emphasis on previously underserved mid- and low-resource languages. By integrating a two-stage LLM-based embedding training pipeline with matryoshka learning, model pruning, and knowledge distillation techniques, we present models that are far more efficient than previous LLM-based embedding models while retaining competitive performances. Extensive evaluations confirm that F2LLM-v2-14B ranks first on 11 MTEB benchmarks, while the smaller models in the family also set a new state of the art for resource-constrained applications. To facilitate open-source embedding model research, we release all models, data, code, and intermediate checkpoints.

0 Citations

0 Influential

4.5 Altmetric

22.5 Score

Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!