arxiv_cl 85% Match Research Paper NLP Researchers,ML Engineers,Computational Linguists,LLM Developers 1 week ago

Explaining and Mitigating Crosslingual Tokenizer Inequities

large-language-models › model-architecture

📄 Abstract

Abstract: The number of tokens it takes to encode parallel text in different languages is known to vary. These disparities are called token premiums. Having high token premiums leads to less throughput during training and increases costs at inference. In this paper, we show that even after controlling for dataset size, vocabulary size, and data content, monolingual tokenizers exhibit a wide range of token premiums across languages. To understand the cross-linguistic differences that cause these token premiums, we train a suite of approximately 7,000 comparable monolingual tokenizers for 97 languages, manipulating tokenization algorithm, vocabulary size, and dataset size. We measure token premiums and test for a relationship between factors such as data similarity (between tokenizer training and evaluation), vocabulary size, and pre-tokenization. We also investigate the role of language-specific features such as writing system and word length. We find that similarity between training and test data does not impact token premiums, but vocabulary size and pre-tokenization do. While simply increasing vocabulary size does not lead to reduced token premium effects, we can determine an ``optimal'' vocabulary size for each language to achieve significantly reduced token premium effects. We also train superword tokenizers which allow merges over whitespaces, and we find that they both reduce token premium effects and improve compression overall. Thus, intervening on the vocabulary size or the pre-tokenizer significantly reduces crosslingual token premium effects.

Authors (4)

Catherine Arnett

Tyler A. Chang

Stella Biderman

Benjamin K. Bergen

Submitted

October 24, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper explains and mitigates crosslingual tokenizer inequities by analyzing token premiums across 97 languages. It investigates factors like dataset size, vocabulary size, data content, writing system, and word length, finding that monolingual tokenizers exhibit significant disparities, impacting model efficiency and cost.

Business Value

Helps optimize NLP models for global markets by reducing computational costs and improving processing speed for diverse languages, making AI more accessible and affordable.

Paper Metadata

Innovation Type

Analysis and Explanation

Deployment Feasibility

High, as it provides insights for better tokenizer design and selection.

Limitations Addressed

Disparities in tokenization efficiency across languages (token premiums), leading to reduced throughput and increased costs.

Performance Gains

Provides understanding and methods to mitigate tokenization inequities, leading to potential improvements in training throughput and inference efficiency for multilingual models.

Technical Tags

TokenizerCrosslingualToken PremiumsThroughputInference CostMonolingual TokenizersVocabulary SizeData SimilarityWriting SystemWord LengthNLP

Research Topics

Natural Language ProcessingMultilingual ModelsTokenizationModel EfficiencyComputational Linguistics

Methods & Architectures

Training comparable monolingual tokenizersMeasuring token premiumsAnalyzing factors influencing token premiums Tokenizers (various algorithms)

Applications & Tasks

Multilingual NLP Machine Translation Cross-lingual Information Retrieval Tokenization inequities across languagesHigh token premiumsReduced training throughputIncreased inference costs Explaining token premiumsMitigating tokenizer inequitiesOptimizing tokenization for multilingual models

Datasets & Benchmarks

Benchmarks

Measured token premiums across ~7,000 comparable monolingual tokenizers for 97 languages.

Token PremiumsThroughputInference CostData SimilarityVocabulary SizeWriting SystemWord Length

Related Fields

Natural Language ProcessingComputational LinguisticsMachine Learning Engineering

Keywords

TokenizerCrosslingualToken PremiumsMultilingual NLPLLM EfficiencyInference CostThroughputVocabularyLanguage FeaturesNLPComputational LinguisticsData Similarity

Academic Context

#Natural Language Processing#Multilingual Models#Tokenization#Model Efficiency#Computational Linguistics

Commercial Potential

Potential Products

Optimized multilingual tokenizersTools for analyzing tokenizer performance across languages

Target Industries

TechnologyAI ResearchSoftware DevelopmentGlobal Communications

Use Case Examples

Reducing costs for multilingual chatbotsImproving speed of cross-lingual search enginesEnabling more efficient training of global NLP models

Competitive Edge

Offers a systematic analysis and explanation for crosslingual tokenizer performance differences, guiding better design choices.

Market Opportunity

Vast market for multilingual NLP tools and services.

Revenue Models

Consulting on multilingual NLP optimizationdevelopment of specialized tokenizers.

Resource Requirements

Compute Needs

High, for training ~7,000 tokenizers.

Data Requirements

Large text corpora for 97 languages.

Deployment Constraints

Findings highlight inherent challenges in tokenization across diverse languages.

Scalability

Addresses efficiency and cost, key factors for scaling NLP models.

Production Readiness

Maturity Level

Research Finding

Time to Market

1-3 years for adoption in major NLP frameworks.

View Full Paper Back to Papers