arxiv_cl 95% Match Research paper Speech technologists,Linguists,AI researchers,Community organizers,Ethicists 3 weeks ago

Quechua Speech Datasets in Common Voice: The Case of Puno Quechua

speech-audio › speech-recognition

📄 Abstract

Abstract: Under-resourced languages, such as Quechuas, face data and resource scarcity, hindering their development in speech technology. To address this issue, Common Voice presents a crucial opportunity to foster an open and community-driven speech dataset creation. This paper examines the integration of Quechua languages into Common Voice. We detail the current 17 Quechua languages, presenting Puno Quechua (ISO 639-3: qxp) as a focused case study that includes language onboarding and corpus collection of both reading and spontaneous speech data. Our results demonstrate that Common Voice now hosts 191.1 hours of Quechua speech (86\% validated), with Puno Quechua contributing 12 hours (77\% validated), highlighting the Common Voice's potential. We further propose a research agenda addressing technical challenges, alongside ethical considerations for community engagement and indigenous data sovereignty. Our work contributes towards inclusive voice technology and digital empowerment of under-resourced language communities.

Authors (4)

Elwin Huaman

Wendi Huaman

Jorge Luis Huaman

Ninfa Quispe

Submitted

October 13, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper addresses the critical data scarcity issue for under-resourced languages like Quechua by detailing the integration of Puno Quechua into the Common Voice dataset. It highlights the potential of community-driven initiatives for speech technology development and proposes a research agenda for technical and ethical challenges, including indigenous data sovereignty.

Business Value

Enabling speech technology for previously underserved linguistic communities can unlock new markets and applications, fostering digital inclusion and economic opportunities for these populations.

Paper Metadata

Innovation Type

Dataset creation and community engagement framework

Deployment Feasibility

High, as it focuses on data collection and community building, which are foundational steps for any speech technology deployment.

Limitations Addressed

Data scarcity for under-resourced languages,Lack of standardized speech datasets

Technical Tags

speech datasetsunder-resourced languagescorpus collectiondata sovereigntylanguage onboardingcommunity-driven dataindigenous data

Research Topics

Speech TechnologyData ScarcityLinguistic DiversityCommunity EngagementEthical AI

Methods & Architectures

Corpus collectionData validationLanguage onboarding

Applications & Tasks

Speech Technology Natural Language Processing Data scarcity for under-resourced languagesLack of speech datasets Speech dataset creationSpeech recognition for Quechua

Datasets & Benchmarks

Datasets

Common Voice, Puno Quechua

Percentage of validated data

Related Fields

Computational LinguisticsSociolinguisticsDigital HumanitiesAI Ethics

Keywords

QuechuaPuno QuechuaCommon Voiceunder-resourced languagesspeech datasetscorpusdata collectionlanguage technologyindigenous languagesdata sovereigntyspeech recognitionlinguisticsAI for good

Academic Context

#Speech Technology#Data Scarcity#Linguistic Diversity#Community Engagement#Ethical AI

Commercial Potential

Potential Products

Speech recognition systems for QuechuaLanguage learning toolsVoice assistants

Target Industries

TechnologyEducationMediaGovernment

Use Case Examples

Developing voice-controlled interfaces for Quechua speakersPreserving and promoting indigenous languages through technology

Competitive Edge

Addresses a gap in speech technology resources for Quechua, which is currently underserved compared to major languages.

Market Opportunity

Growing interest in inclusive AI and under-resourced language technologies.

Revenue Models

Indirectthrough enabling downstream applications and services.

Resource Requirements

Compute Needs

Low (for data collection and validation)

Data Requirements

Speech recordings, metadata

Deployment Constraints

Community engagement, ethical approvals, data privacy

Scalability

Scalable through community participation and platform infrastructure.

Regulatory Considerations

Data privacyindigenous data rightsethical guidelines

Production Readiness

Maturity Level

Early stage (dataset creation)

Time to Market

N/A (focus on foundational data)

Licensing

Likely open source/creative commons for datasets

Patent Potential

Low (focus on open data)

View Full Paper Back to Papers