Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: ParlaSpeech is a collection of spoken parliamentary corpora currently
spanning four Slavic languages - Croatian, Czech, Polish and Serbian - all
together 6 thousand hours in size. The corpora were built in an automatic
fashion from the ParlaMint transcripts and their corresponding metadata, which
were aligned to the speech recordings of each corresponding parliament. In this
release of the dataset, each of the corpora is significantly enriched with
various automatic annotation layers. The textual modality of all four corpora
has been enriched with linguistic annotations and sentiment predictions.
Similar to that, their spoken modality has been automatically enriched with
occurrences of filled pauses, the most frequent disfluency in typical speech.
Two out of the four languages have been additionally enriched with detailed
word- and grapheme-level alignments, and the automatic annotation of the
position of primary stress in multisyllabic words. With these enrichments, the
usefulness of the underlying corpora has been drastically increased for
downstream research across multiple disciplines, which we showcase through an
analysis of acoustic correlates of sentiment. All the corpora are made
available for download in JSONL and TextGrid formats, as well as for search
through a concordancer.
Authors (4)
Nikola LjubeΕ‘iΔ
Peter Rupnik
Ivan Porupski
Taja Kuzman PungerΕ‘ek
Submitted
November 3, 2025
Key Contributions
This paper presents ParlaSpeech 3.0, a significantly enriched collection of spoken parliamentary corpora for Croatian, Czech, Polish, and Serbian, totaling 6,000 hours. The corpora are automatically built and enriched with extensive annotation layers, including linguistic features, sentiment predictions, disfluency occurrences, and detailed word/grapheme-level alignments, making them highly valuable for speech and language research.
Business Value
Provides foundational data for developing improved speech recognition, translation, and analysis tools for Slavic languages, potentially opening new markets for speech technology companies.