Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Existing linguistic knowledge bases such as URIEL+ provide valuable
geographic, genetic and typological distances for cross-lingual transfer but
suffer from two key limitations. One, their one-size-fits-all vector
representations are ill-suited to the diverse structures of linguistic data,
and two, they lack a principled method for aggregating these signals into a
single, comprehensive score. In this paper, we address these gaps by
introducing a framework for type-matched language distances. We propose novel,
structure-aware representations for each distance type: speaker-weighted
distributions for geography, hyperbolic embeddings for genealogy, and a latent
variables model for typology. We unify these signals into a robust,
task-agnostic composite distance. In selecting transfer languages, our
representations and composite distances consistently improve performance across
a wide range of NLP tasks, providing a more principled and effective toolkit
for multilingual research.
Authors (8)
York Hay Ng
Aditya Khan
Xiang Lu
Matteo Salloum
Michael Zhou
Phuong H. Hoang
+2 more
Submitted
October 22, 2025
Key Contributions
This paper proposes a framework for type-matched language distances to improve cross-lingual transfer, addressing limitations of existing knowledge bases like URIEL+. It introduces novel, structure-aware representations for geography, genealogy, and typology, unifying them into a composite distance that consistently improves performance across various NLP tasks.
Business Value
Enables more effective development of multilingual NLP applications, reducing the cost and effort required for cross-lingual adaptation and improving performance.