Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Text embeddings are an essential building component of several NLP tasks such
as retrieval-augmented generation which is crucial for preventing
hallucinations in LLMs. Despite the recent release of massively multilingual
MTEB (MMTEB), African languages remain underrepresented, with existing tasks
often repurposed from translation benchmarks such as FLORES clustering or
SIB-200. In this paper, we introduce AfriMTEB -- a regional expansion of MMTEB
covering 59 languages, 14 tasks, and 38 datasets, including six newly added
datasets. Unlike many MMTEB datasets that include fewer than five languages,
the new additions span 14 to 56 African languages and introduce entirely new
tasks, such as hate speech detection, intent detection, and emotion
classification, which were not previously covered. Complementing this, we
present AfriE5, an adaptation of the instruction-tuned mE5 model to African
languages through cross-lingual contrastive distillation. Our evaluation shows
that AfriE5 achieves state-of-the-art performance, outperforming strong
baselines such as Gemini-Embeddings and mE5.
Authors (3)
Kosei Uemura
Miaoran Zhang
David Ifeoluwa Adelani
Submitted
October 27, 2025
Key Contributions
This paper introduces AfriMTEB, a comprehensive benchmark for African languages, addressing the underrepresentation in existing multilingual datasets. It also presents AfriE5, an adapted text embedding model trained via cross-lingual distillation, showing improved performance for these languages.
Business Value
Enables the development of NLP applications and services that are more inclusive and effective for speakers of African languages, opening up new markets and improving accessibility.