arxiv_cl 85% Match Dataset Paper Linguists,Speech Researchers,NLP Developers,Computational Linguists,Political Scientists 1 day ago

ParlaSpeech 3.0: Richly Annotated Spoken Parliamentary Corpora of Croatian, Czech, Polish, and Serbian

speech-audio › speech-recognition

📄 Abstract

Abstract: ParlaSpeech is a collection of spoken parliamentary corpora currently spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all together 6 thousand hours in size. The corpora were built in an automatic fashion from the ParlaMint transcripts and their corresponding metadata, which were aligned to the speech recordings of each corresponding parliament. In this release of the dataset, each of the corpora is significantly enriched with various automatic annotation layers. The textual modality of all four corpora has been enriched with linguistic annotations and sentiment predictions. Similar to that, their spoken modality has been automatically enriched with occurrences of filled pauses, the most frequent disfluency in typical speech. Two out of the four languages have been additionally enriched with detailed word- and grapheme-level alignments, and the automatic annotation of the position of primary stress in multisyllabic words. With these enrichments, the usefulness of the underlying corpora has been drastically increased for downstream research across multiple disciplines, which we showcase through an analysis of acoustic correlates of sentiment. All the corpora are made available for download in JSONL and TextGrid formats, as well as for search through a concordancer.

Authors (4)

Nikola Ljubešić

Peter Rupnik

Ivan Porupski

Taja Kuzman Pungeršek

Submitted

November 3, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper presents ParlaSpeech 3.0, a significantly enriched collection of spoken parliamentary corpora for Croatian, Czech, Polish, and Serbian, totaling 6,000 hours. The corpora are automatically built and enriched with extensive annotation layers, including linguistic features, sentiment predictions, disfluency occurrences, and detailed word/grapheme-level alignments, making them highly valuable for speech and language research.

Business Value

Provides foundational data for developing improved speech recognition, translation, and analysis tools for Slavic languages, potentially opening new markets for speech technology companies.

Paper Metadata

Innovation Type

Dataset Creation/Enrichment

Deployment Feasibility

The dataset is a research resource, not a deployable system. Its creation is feasible through automated processes.

Limitations Addressed

Limited availability of large-scale, annotated spoken corpora for specific Slavic languages,Need for diverse annotation layers (linguistic, acoustic, sentiment) in spoken data

Performance Gains

Enrichments significantly increase the usefulness of the corpora for various research tasks.

Technical Tags

spoken corporaparliamentary proceedingsSlavic languagesCroatianCzechPolishSerbianautomatic annotationlinguistic annotationssentiment predictiondisfluenciesstress predictionspeech alignment

Research Topics

Corpus LinguisticsSpeech ProcessingNatural Language ProcessingSlavic LanguagesLinguistic Annotation

Methods & Architectures

Automatic corpus constructionAlignment of transcripts to speechAutomatic annotation layers (linguistic, sentiment, disfluency, stress, alignment)

Applications & Tasks

Linguistic Research Speech Technology Development Computational Linguistics Political Science Research Lack of large, annotated spoken corpora for Slavic languagesNeed for rich linguistic and acoustic annotationsFacilitating research in speech and language Speech recognition researchLinguistic analysis of spoken languageSentiment analysis on spoken dataPhonetic and prosodic analysis

Datasets & Benchmarks

Datasets

ParlaSpeech 3.0, ParlaMint transcripts

Related Fields

LinguisticsSpeech TechnologyNatural Language ProcessingComputational LinguisticsSlavic StudiesData Science

Keywords

spoken corporaparliamentarySlavic languagesCroatianCzechPolishSerbianannotationspeechlinguisticssentimentdisfluencystress

Academic Context

#Corpus Linguistics#Speech Processing#Natural Language Processing#Slavic Languages#Linguistic Annotation

Commercial Potential

Potential Products

Advanced speech recognition systems for Slavic languagesTools for analyzing political discourseLanguage learning software

Target Industries

TechnologyAcademiaMediaGovernmentTranslation Services

Use Case Examples

Training a speech recognition model for the Croatian parliament.Analyzing sentiment trends in Czech parliamentary debates.Studying disfluency patterns in spoken Polish.

Competitive Edge

Offers a uniquely large and richly annotated collection of spoken parliamentary data for multiple Slavic languages, surpassing existing resources in scale and annotation depth.

Resource Requirements

Compute Needs

Moderate, for processing and annotating large audio files.

Data Requirements

Requires large volumes of spoken parliamentary recordings and corresponding transcripts.

Deployment Constraints

The dataset itself is a resource; deployment constraints apply to systems built using it.

Scalability

The automatic construction and annotation pipeline is designed for scalability to potentially include more languages or data.

Production Readiness

Maturity Level

Established Dataset

View Full Paper Back to Papers