arxiv_cl 90% Match Research Paper AI Researchers,Computational Linguists,NLP Engineers,Machine Learning Scientists 2 weeks ago

Modality Matching Matters: Calibrating Language Distances for Cross-Lingual Transfer in URIEL+

large-language-models › training-methods

📄 Abstract

Abstract: Existing linguistic knowledge bases such as URIEL+ provide valuable geographic, genetic and typological distances for cross-lingual transfer but suffer from two key limitations. One, their one-size-fits-all vector representations are ill-suited to the diverse structures of linguistic data, and two, they lack a principled method for aggregating these signals into a single, comprehensive score. In this paper, we address these gaps by introducing a framework for type-matched language distances. We propose novel, structure-aware representations for each distance type: speaker-weighted distributions for geography, hyperbolic embeddings for genealogy, and a latent variables model for typology. We unify these signals into a robust, task-agnostic composite distance. In selecting transfer languages, our representations and composite distances consistently improve performance across a wide range of NLP tasks, providing a more principled and effective toolkit for multilingual research.

Authors (8)

York Hay Ng

Aditya Khan

Xiang Lu

Matteo Salloum

Michael Zhou

Phuong H. Hoang

+2 more

Submitted

October 22, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper proposes a framework for type-matched language distances to improve cross-lingual transfer, addressing limitations of existing knowledge bases like URIEL+. It introduces novel, structure-aware representations for geography, genealogy, and typology, unifying them into a composite distance that consistently improves performance across various NLP tasks.

Business Value

Enables more effective development of multilingual NLP applications, reducing the cost and effort required for cross-lingual adaptation and improving performance.

Paper Metadata

Innovation Type

Framework/Methodology

Deployment Feasibility

High. The framework provides a more principled way to select transfer languages, which can be integrated into existing NLP pipelines.

Limitations Addressed

One-size-fits-all vector representations in linguistic knowledge bases and the lack of a principled method for aggregating diverse linguistic signals.

Performance Gains

Consistently improve performance across a wide range of NLP tasks when selecting transfer languages.

Technical Tags

cross-lingual transferlinguistic distancelanguage representationsURIEL+typologygenealogygeographyhyperbolic embeddingslatent variables modelNLP tasks

Research Topics

Computational LinguisticsCross-Lingual NLPMachine LearningLinguistic TypologyKnowledge Representation

Methods & Architectures

Type-matched language distance representationsStructure-aware representationsHyperbolic embeddingsLatent variable modelsComposite distance calculation LLMs (general)URIEL+

Applications & Tasks

Linguistics Natural Language Processing Computational Social Science Limitations of existing linguistic knowledge basesIll-suited vector representationsLack of principled aggregation of linguistic signals Improving cross-lingual transfer performanceDeveloping type-matched language distance representationsCreating a robust, task-agnostic composite distance score

Datasets & Benchmarks

Benchmarks

Performance improvement across a wide range of NLP tasks

Performance on NLP tasks (e.g., translation, classification)Accuracy of language distance predictions

Related Fields

Computational LinguisticsNatural Language ProcessingMachine LearningLinguisticsData Mining

Keywords

Cross-Lingual TransferLinguistic DistanceLanguage RepresentationsURIEL+TypologyGenealogyGeographyHyperbolic EmbeddingsLatent VariablesMultilingual NLPNLP TasksLanguage Models

Academic Context

#Computational Linguistics#Cross-Lingual NLP#Machine Learning#Linguistic Typology#Knowledge Representation

Technology Stack

Frameworks & Libraries

URIEL+

Commercial Potential

Potential Products

Tools for selecting optimal source languages for cross-lingual transferImproved multilingual NLP modelsLinguistic analysis platforms

Target Industries

TechnologyTranslation ServicesGlobal CommunicationsResearch

Use Case Examples

Selecting the best source language to train a model for a low-resource target language.Improving machine translation quality by leveraging typological similarities.

Competitive Edge

Offers a more principled and effective method for calculating language distances compared to existing approaches, leading to better cross-lingual transfer performance.

Market Opportunity

Large market for multilingual NLP solutions.

Revenue Models

Licensing the methodologyoffering services for language selection.

Resource Requirements

Compute Needs

Moderate for training embeddings and calculating distances.

Data Requirements

Linguistic knowledge bases (like URIEL+), data for training embeddings (e.g., language corpora).

Deployment Constraints

Requires accurate and comprehensive linguistic data for all languages of interest.

Scalability

Scalable for calculating distances between language pairs.

Regulatory Considerations

N/A (focus on linguistic modeling).

Production Readiness

Maturity Level

Research Framework

Time to Market

1-2 years for integration into NLP toolkits.

Licensing

Likely open-source.

Patent Potential

Moderate, particularly around the novel representations and composite distance calculation.

View Full Paper Back to Papers