Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Most of the world's languages and dialects are low-resource, and lack support
in mainstream machine translation (MT) models. However, many of them have a
closely-related high-resource language (HRL) neighbor, and differ in
linguistically regular ways from it. This underscores the importance of model
robustness to dialectal variation and cross-lingual generalization to the HRL
dialect continuum. We present DialUp, consisting of a training-time technique
for adapting a pretrained model to dialectal data (M->D), and an inference-time
intervention adapting dialectal data to the model expertise (D->M). M->D
induces model robustness to potentially unseen and unknown dialects by exposure
to synthetic data exemplifying linguistic mechanisms of dialectal variation,
whereas D->M treats dialectal divergence for known target dialects. These
methods show considerable performance gains for several dialects from four
language families, and modest gains for two other language families. We also
conduct feature and error analyses, which show that language varieties with low
baseline MT performance are more likely to benefit from these approaches.
Authors (7)
Niyati Bafna
Emily Chang
Nathaniel R. Robinson
David R. Mortensen
Kenton Murray
David Yarowsky
+1 more
Submitted
January 27, 2025
Key Contributions
Introduces DialUp, a system with two techniques: M->D (adapting models to dialects) and D->M (adapting dialects to models). M->D uses synthetic data to improve robustness to unseen dialects, while D->M handles known dialects, significantly improving MT performance across language continua.
Business Value
Expands the reach of NLP technologies to a wider range of languages and dialects, fostering global communication and access to information for underserved linguistic communities.