arxiv_cl 95% Match Research Paper NLP Researchers,Machine Learning Engineers,Developers of Multilingual AI Systems,Software Engineers 1 week ago

Zero-Shot Tokenizer Transfer

large-language-models › model-architecture

📄 Abstract

Abstract: Language models (LMs) are bound to their tokenizer, which maps raw text to a sequence of vocabulary items (tokens). This restricts their flexibility: for example, LMs trained primarily on English may still perform well in other natural and programming languages, but have vastly decreased efficiency due to their English-centric tokenizer. To mitigate this, we should be able to swap the original LM tokenizer with an arbitrary one, on the fly, without degrading performance. Hence, in this work we define a new problem: Zero-Shot Tokenizer Transfer (ZeTT). The challenge at the core of ZeTT is finding embeddings for the tokens in the vocabulary of the new tokenizer. Since prior heuristics for initializing embeddings often perform at chance level in a ZeTT setting, we propose a new solution: we train a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings. We empirically demonstrate that the hypernetwork generalizes to new tokenizers both with encoder (e.g., XLM-R) and decoder LLMs (e.g., Mistral-7B). Our method comes close to the original models' performance in cross-lingual and coding tasks while markedly reducing the length of the tokenized sequence. We also find that the remaining gap can be quickly closed by continued training on less than 1B tokens. Finally, we show that a ZeTT hypernetwork trained for a base (L)LM can also be applied to fine-tuned variants without extra training. Overall, our results make substantial strides toward detaching LMs from their tokenizer.

Authors (3)

Benjamin Minixhofer

Edoardo Maria Ponti

Ivan Vulić

Submitted

May 13, 2024

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper defines and addresses the problem of Zero-Shot Tokenizer Transfer (ZeTT), enabling language models to swap their original tokenizer with an arbitrary one without performance degradation. It proposes a novel solution using a hypernetwork that takes a tokenizer as input and predicts the corresponding embeddings. This approach significantly improves efficiency for LMs operating on languages or codebases different from their training data, overcoming the limitations of previous heuristics.

Business Value

Allows for more flexible and efficient deployment of large language models across diverse languages and programming languages without costly retraining, significantly reducing operational costs and expanding the applicability of LLMs.

Paper Metadata

Innovation Type

Algorithmic

Deployment Feasibility

High. The method allows for on-the-fly tokenizer swapping, making it adaptable to various deployment scenarios.

Limitations Addressed

The restriction of LMs to their fixed tokenizer, leading to decreased efficiency and performance when encountering new languages or code structures; the inadequacy of prior heuristics for initializing embeddings for new tokenizers.

Technical Tags

tokenizer transferlanguage modelszero-shot learningembeddingshypernetworkvocabulary mappingefficiencycross-lingual modelsprogramming language models

Research Topics

Language Model TokenizationZero-Shot LearningModel AdaptabilityCross-Lingual NLPEfficient NLP

Methods & Architectures

Zero-Shot Tokenizer Transfer (ZeTT)training a hypernetwork to predict embeddingsswapping tokenizers on the fly HypernetworkLanguage Models (LMs)

Applications & Tasks

Machine Translation Cross-Lingual Information Retrieval Code Generation Natural Language Understanding Decreased efficiency of LMs with non-native tokenizersInability to swap tokenizers without performance degradationPoor performance of heuristic embedding initialization for new tokenizers Tokenizer TransferEmbedding GenerationZero-Shot Adaptation

Related Fields

Natural Language ProcessingMachine LearningComputational LinguisticsDeep Learning

Keywords

tokenizerlanguage modelzero-shot transferembeddingshypernetworkcross-lingualprogramming languagesefficiencyNLPvocabularymodel adaptationLLM

Academic Context

#Language Model Tokenization#Zero-Shot Learning#Model Adaptability#Cross-Lingual NLP#Efficient NLP

Commercial Potential

Potential Products

Flexible LLM deployment platformsTools for adapting LLMs to new languages/codeEfficient multilingual NLP services

Target Industries

TechnologySoftware DevelopmentTranslation ServicesContent Creation

Use Case Examples

Using a single English-trained LLM for multiple European languages without retrainingAdapting code generation models to new programming languagesImproving the efficiency of LLMs on low-resource languages

Competitive Edge

Offers a novel and effective method for tokenizer transfer, outperforming previous heuristics and enabling true zero-shot adaptation to new tokenizers.

Market Opportunity

Large and growing market for LLM deployment and adaptation services.

Revenue Models

Licensing of the ZeTT technologyintegration into LLM platforms.

Resource Requirements

Compute Needs

Moderate for training the hypernetwork; low for inference (tokenizer swapping).

Data Requirements

Requires access to various tokenizers and potentially large text corpora for evaluating performance.

Deployment Constraints

The quality of the generated embeddings depends on the hypernetwork's training and generalization capabilities.

Scalability

The hypernetwork approach is scalable to a large number of tokenizers.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years

Patent Potential

Moderate, for the hypernetwork-based embedding generation for tokenizer transfer.

View Full Paper Back to Papers