Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Multilingual vision-language models (VLMs) promise universal image-text
retrieval, yet their social biases remain underexplored. We perform the first
systematic audit of four public multilingual CLIP variants: M-CLIP, NLLB-CLIP,
CAPIVARA-CLIP, and the debiased SigLIP-2, covering ten languages that differ in
resource availability and morphological gender marking. Using balanced subsets
of FairFace and the PATA stereotype suite in a zero-shot setting, we quantify
race and gender bias and measure stereotype amplification. Contrary to the
intuition that multilinguality mitigates bias, every model exhibits stronger
gender skew than its English-only baseline. CAPIVARA-CLIP shows its largest
biases precisely in the low-resource languages it targets, while the shared
encoder of NLLB-CLIP and SigLIP-2 transfers English gender stereotypes into
gender-neutral languages; loosely coupled encoders largely avoid this leakage.
Although SigLIP-2 reduces agency and communion skews, it inherits -- and in
caption-sparse contexts (e.g., Xhosa) amplifies -- the English anchor's crime
associations. Highly gendered languages consistently magnify all bias types,
yet gender-neutral languages remain vulnerable whenever cross-lingual weight
sharing imports foreign stereotypes. Aggregated metrics thus mask
language-specific hot spots, underscoring the need for fine-grained,
language-aware bias evaluation in future multilingual VLM research.
Authors (3)
Zahraa Al Sahili
Ioannis Patras
Matthew Purver
Key Contributions
Conducts the first systematic audit of four multilingual CLIP variants (M-CLIP, NLLB-CLIP, CAPIVARA-CLIP, SigLIP-2) across ten languages to quantify race and gender bias. Finds that multilinguality often strengthens gender bias, and bias can transfer from high-resource to low-resource languages, contrary to expectations.
Business Value
Highlights critical ethical considerations for developing and deploying global AI systems, promoting fairness and mitigating harmful biases in multimodal applications.