2606.13647v1 Jun 11, 2026 cs.CL

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

Marek vSuppa
Marek vSuppa
Citations: 185
h-index: 3
Andrej Ridzik
Andrej Ridzik
Citations: 17
h-index: 2
D. Hládek
D. Hládek
Citations: 580
h-index: 12
Nat'alia Kvnavzekov'a
Nat'alia Kvnavzekov'a
Citations: 0
h-index: 0
Viktoria Ondrejova
Viktoria Ondrejova
Citations: 27
h-index: 4

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

0 Citations
0 Influential
6 Altmetric
30.0 Score
Original PDF

No Analysis Report Yet

This paper hasn't been analyzed by Gemini yet.

Log in to request an AI analysis.

댓글

댓글을 작성하려면 로그인하세요.

아직 댓글이 없습니다. 첫 번째 댓글을 남겨보세요!