arxiv_ai 98% Match Research Paper Speech researchers,AI engineers,Audiologists,Individuals who stutter,Developers of assistive technologies 2 weeks ago

StutterZero and StutterFormer: End-to-End Speech Conversion for Stuttering Transcription and Correction

speech-audio › speech-recognition

📄 Abstract

Abstract: Over 70 million people worldwide experience stuttering, yet most automatic speech systems misinterpret disfluent utterances or fail to transcribe them accurately. Existing methods for stutter correction rely on handcrafted feature extraction or multi-stage automatic speech recognition (ASR) and text-to-speech (TTS) pipelines, which separate transcription from audio reconstruction and often amplify distortions. This work introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models that directly convert stuttered speech into fluent speech while jointly predicting its transcription. StutterZero employs a convolutional-bidirectional LSTM encoder-decoder with attention, whereas StutterFormer integrates a dual-stream Transformer with shared acoustic-linguistic representations. Both architectures are trained on paired stuttered-fluent data synthesized from the SEP-28K and LibriStutter corpora and evaluated on unseen speakers from the FluencyBank dataset. Across all benchmarks, StutterZero had a 24% decrease in Word Error Rate (WER) and a 31% improvement in semantic similarity (BERTScore) compared to the leading Whisper-Medium model. StutterFormer achieved better results, with a 28% decrease in WER and a 34% improvement in BERTScore. The results validate the feasibility of direct end-to-end stutter-to-fluent speech conversion, offering new opportunities for inclusive human-computer interaction, speech therapy, and accessibility-oriented AI systems.

Authors (1)

Qianheng Xu

Submitted

October 21, 2025

arXiv Category

eess.AS

arXiv PDF

Key Contributions

This paper introduces StutterZero and StutterFormer, the first end-to-end waveform-to-waveform models for stuttering correction and transcription. By directly converting stuttered speech to fluent speech while jointly predicting transcription, these models overcome the limitations of multi-stage pipelines, offering a more integrated and potentially higher-fidelity solution for millions affected by stuttering.

Business Value

Developing effective tools for individuals who stutter can significantly improve their communication, social interaction, and professional opportunities. This technology could be integrated into communication apps, virtual assistants, and assistive devices.

Paper Metadata

Innovation Type

Architectural Innovation

Deployment Feasibility

Moderate to High. Requires significant computational resources for training and inference, but the end-to-end nature simplifies deployment compared to multi-stage systems.

Limitations Addressed

Misinterpretation of disfluent utterances by ASR systems,Failure of ASR to transcribe stuttered speech accurately,Limitations of multi-stage ASR-TTS pipelines (separate transcription and reconstruction, amplification of distortions)

Performance Gains

Implied improvement over existing multi-stage methods by offering a unified, end-to-end solution.

Technical Tags

Speech ConversionStuttering CorrectionEnd-to-End ModelsWaveform-to-WaveformASRTTSLSTMTransformerDeep Learning

Research Topics

Speech ProcessingAssistive TechnologiesDeep Learning for AudioSpeech SynthesisAutomatic Speech Recognition

Methods & Architectures

End-to-end waveform-to-waveform modelingConvolutional-bidirectional LSTM encoder-decoder with attentionDual-stream Transformer with shared acoustic-linguistic representations StutterZero (CNN-BiLSTM-Attention)StutterFormer (Transformer)

Applications & Tasks

Speech Therapy Assistive Communication Human-Computer Interaction Accurate transcription of disfluent speechCorrection of stuttered speechImproving ASR/TTS pipelines for disfluent speakers Speech transcriptionSpeech conversion (stuttered to fluent)End-to-end speech processing

Datasets & Benchmarks

Datasets

SEP-28K, LibriStutter, FluencyBank

Transcription accuracySpeech fluencyAudio quality

Related Fields

Speech RecognitionSpeech SynthesisDigital Signal ProcessingMachine LearningAssistive Technology

Keywords

stutteringspeech conversionend-to-endwaveform-to-waveformASRTTSLSTMTransformerdeep learningspeech processingassistive technologyfluency

Academic Context

#Speech Processing#Assistive Technologies#Deep Learning for Audio#Speech Synthesis#Automatic Speech Recognition

Technology Stack

Frameworks & Libraries

LSTMTransformer

Commercial Potential

Potential Products

Real-time stuttering correction softwareASR systems for disfluent speechSpeech therapy tools

Target Industries

Healthcare (Speech Therapy)Technology (Assistive Devices)Telecommunications

Use Case Examples

A mobile app that allows users who stutter to speak more fluently in real-time.Virtual assistants that can accurately understand and respond to users who stutter.Tools for speech therapists to analyze and improve patient fluency.

Competitive Edge

First end-to-end waveform-to-waveform models for this task, offering a more integrated approach than existing multi-stage systems.

Market Opportunity

Large potential market given the global prevalence of stuttering.

Revenue Models

Software licensingsubscription servicesintegration into hardware.

Resource Requirements

Compute Needs

High (for training deep learning models)

Data Requirements

Paired stuttered-fluent speech data.

Deployment Constraints

Real-time processing requires efficient models and sufficient computational power.

Scalability

Scalability depends on model architecture and available compute resources.

Regulatory Considerations

None explicitly mentionedbut potential for medical device classification if used in therapy.

Production Readiness

Maturity Level

Research Prototype

Time to Market

1-2 years for a polished product

Patent Potential

Potential for patents on novel architectures and training methods.

View Full Paper Back to Papers