Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Under-resourced languages, such as Quechuas, face data and resource scarcity,
hindering their development in speech technology. To address this issue, Common
Voice presents a crucial opportunity to foster an open and community-driven
speech dataset creation. This paper examines the integration of Quechua
languages into Common Voice. We detail the current 17 Quechua languages,
presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes
language onboarding and corpus collection of both reading and spontaneous
speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of
Quechua speech (86\% validated), with Puno Quechua contributing 12 hours (77\%
validated), highlighting the Common Voice's potential. We further propose a
research agenda addressing technical challenges, alongside ethical
considerations for community engagement and indigenous data sovereignty. Our
work contributes towards inclusive voice technology and digital empowerment of
under-resourced language communities.
Authors (4)
Elwin Huaman
Wendi Huaman
Jorge Luis Huaman
Ninfa Quispe
Submitted
October 13, 2025
Key Contributions
This paper addresses the critical data scarcity issue for under-resourced languages like Quechua by detailing the integration of Puno Quechua into the Common Voice dataset. It highlights the potential of community-driven initiatives for speech technology development and proposes a research agenda for technical and ethical challenges, including indigenous data sovereignty.
Business Value
Enabling speech technology for previously underserved linguistic communities can unlock new markets and applications, fostering digital inclusion and economic opportunities for these populations.