arxiv_cl 90% Match Research Paper NLP researchers,Developers working with African languages,Linguists 1 week ago

AfriMTEB and AfriE5: Benchmarking and Adapting Text Embedding Models for African Languages

large-language-models › evaluation

📄 Abstract

Abstract: Text embeddings are an essential building component of several NLP tasks such as retrieval-augmented generation which is crucial for preventing hallucinations in LLMs. Despite the recent release of massively multilingual MTEB (MMTEB), African languages remain underrepresented, with existing tasks often repurposed from translation benchmarks such as FLORES clustering or SIB-200. In this paper, we introduce AfriMTEB -- a regional expansion of MMTEB covering 59 languages, 14 tasks, and 38 datasets, including six newly added datasets. Unlike many MMTEB datasets that include fewer than five languages, the new additions span 14 to 56 African languages and introduce entirely new tasks, such as hate speech detection, intent detection, and emotion classification, which were not previously covered. Complementing this, we present AfriE5, an adaptation of the instruction-tuned mE5 model to African languages through cross-lingual contrastive distillation. Our evaluation shows that AfriE5 achieves state-of-the-art performance, outperforming strong baselines such as Gemini-Embeddings and mE5.

Authors (3)

Kosei Uemura

Miaoran Zhang

David Ifeoluwa Adelani

Submitted

October 27, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces AfriMTEB, a comprehensive benchmark for African languages, addressing the underrepresentation in existing multilingual datasets. It also presents AfriE5, an adapted text embedding model trained via cross-lingual distillation, showing improved performance for these languages.

Business Value

Enables the development of NLP applications and services that are more inclusive and effective for speakers of African languages, opening up new markets and improving accessibility.

Paper Metadata

Innovation Type

Dataset Creation & Model Adaptation

Deployment Feasibility

Moderate, requires further integration and fine-tuning of AfriE5 for specific downstream tasks.

Limitations Addressed

The lack of adequate benchmarks and high-quality text embedding models for a wide range of African languages.

Technical Tags

text embeddingsAfrican languagesmultilingual NLPretrieval-augmented generationinstruction tuningcross-lingual distillationhate speech detectionintent detectionemotion classificationbenchmark creation

Research Topics

Multilingual NLPText EmbeddingsLow-Resource LanguagesBenchmark DevelopmentLLM Adaptation

Methods & Architectures

Creation of AfriMTEB benchmarkCross-lingual contrastive distillationInstruction tuning (mE5 adaptation) Text Embedding ModelsmE5 model

Applications & Tasks

Natural Language Processing Information Retrieval Content Moderation Customer Service Underrepresentation of African languagesBenchmarking NLP modelsImproving text embeddings Text EmbeddingHate Speech DetectionIntent DetectionEmotion ClassificationRetrieval-Augmented Generation

Datasets & Benchmarks

Datasets

MMTEB, FLORES, SIB-200, AfriMTEB

Related Fields

Computational LinguisticsAfrican StudiesInformation RetrievalMachine Learning

Keywords

African LanguagesText EmbeddingsMultilingual NLPBenchmarkLow-Resource LanguagesRetrieval-Augmented GenerationHate SpeechIntent DetectionEmotion ClassificationCross-lingual TransferLLM Adaptation

Academic Context

#Multilingual NLP#Text Embeddings#Low-Resource Languages#Benchmark Development#LLM Adaptation

Commercial Potential

Potential Products

Multilingual chatbots and virtual assistants for African marketsContent moderation tools for African social media platformsSearch engines optimized for African languages

Target Industries

TelecommunicationsMedia & EntertainmentTechnologyGovernment

Use Case Examples

Building a customer support chatbot for a mobile operator in Nigeria.Developing a hate speech detection system for social media platforms in Kenya.Improving search relevance for users in Ethiopia.

Competitive Edge

Addresses a critical gap in multilingual NLP resources by focusing specifically on African languages, offering tailored solutions where general models may fail.

Market Opportunity

Growing digital adoption in Africa creates demand for localized NLP solutions.

Revenue Models

API access to embedding modelslicensing of specialized NLP solutions.

Resource Requirements

Compute Needs

Moderate to high for training embedding models and running evaluations.

Data Requirements

Requires diverse text data in various African languages, including newly curated datasets.

Deployment Constraints

Availability of computational resources for deploying large embedding models, and potential challenges in data collection for specific dialects.

Scalability

The benchmark can be extended with more languages and tasks. The embedding model can be scaled by using larger base models or more training data.

Regulatory Considerations

Data privacy concerns for user-generated content used in datasets.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-2 years for specialized applications.

Patent Potential

Low, primarily focused on benchmark creation and model adaptation.

View Full Paper Back to Papers