Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The rapid development of multilingual large language models (LLMs) highlights
the need for high-quality, diverse, and well-curated multilingual datasets. In
this paper, we introduce DCAD-2000 (Data Cleaning as Anomaly Detection), a
large-scale multilingual corpus constructed from newly extracted Common Crawl
data and existing multilingual sources. DCAD-2000 covers 2,282 languages,
46.72TB of text, and 8.63 billion documents, spanning 155 high- and
medium-resource languages and 159 writing scripts. To overcome the limitations
of existing data cleaning approaches, which rely on manually designed heuristic
thresholds, we reframe data cleaning as an anomaly detection problem. This
dynamic filtering paradigm substantially improves data quality by automatically
identifying and removing noisy or anomalous content. By fine-tuning LLMs on
DCAD-2000, we demonstrate notable improvements in data quality, robustness of
the cleaning pipeline, and downstream performance, particularly for
low-resource languages across multiple multilingual benchmarks.
Authors (7)
Yingli Shen
Wen Lai
Shuo Wang
Xueren Zhang
Kangyang Luo
Alexander Fraser
+1 more
Submitted
February 17, 2025
Key Contributions
Introduces DCAD-2000, a massive multilingual dataset (2,282 languages, 46.72TB) constructed using a novel 'Data Cleaning as Anomaly Detection' approach. This method dynamically filters noisy content, significantly improving data quality and demonstrating downstream performance gains for LLMs.
Business Value
Provides a foundational resource for developing more capable and equitable multilingual AI systems, accelerating research and application development across diverse linguistic communities.