arxiv_cl 90% Match Research Paper AI Researchers,TTS Developers,Conversational AI Engineers,HCI Specialists 1 week ago

Enhancing Naturalness in LLM-Generated Utterances through Disfluency Insertion

speech-audio › text-to-speech

📄 Abstract

Abstract: Disfluencies are a natural feature of spontaneous human speech but are typically absent from the outputs of Large Language Models (LLMs). This absence can diminish the perceived naturalness of synthesized speech, which is an important criteria when building conversational agents that aim to mimick human behaviours. We show how the insertion of disfluencies can alleviate this shortcoming. The proposed approach involves (1) fine-tuning an LLM with Low-Rank Adaptation (LoRA) to incorporate various types of disfluencies into LLM-generated utterances and (2) synthesizing those utterances using a text-to-speech model that supports the generation of speech phenomena such as disfluencies. We evaluated the quality of the generated speech across two metrics: intelligibility and perceived spontaneity. We demonstrate through a user study that the insertion of disfluencies significantly increase the perceived spontaneity of the generated speech. This increase came, however, along with a slight reduction in intelligibility.

Authors (3)

Syed Zohaib Hassan

Pierre Lison

Pål Halvorsen

Submitted

December 17, 2024

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Enhances the naturalness of LLM-generated utterances for speech synthesis by inserting disfluencies. The approach involves fine-tuning an LLM with LoRA to incorporate disfluencies and then synthesizing these utterances using a TTS model, significantly increasing perceived spontaneity in user studies.

Business Value

Improves the user experience for voice-based AI systems (e.g., chatbots, virtual assistants) by making their speech sound more human-like and engaging.

Paper Metadata

Innovation Type

Methodological

Deployment Feasibility

Moderate. Requires integration of LLM fine-tuning and a TTS system capable of handling disfluencies.

Limitations Addressed

The unnaturalness and lack of spontaneity in synthesized speech from LLMs, which typically omit natural human speech disfluencies.

Performance Gains

Significant increase in perceived spontaneity

Technical Tags

LLM-generated utterancesnaturalnessdisfluenciesspontaneitytext-to-speech (TTS)Low-Rank Adaptation (LoRA)conversational agentsspeech synthesis

Research Topics

Speech SynthesisNatural Language GenerationHuman-Computer InteractionConversational AIDeep Learning

Methods & Architectures

Fine-tuning LLM with LoRADisfluency insertionTTS synthesis with disfluency support Large Language Models (LLMs)Text-to-Speech (TTS) models

Applications & Tasks

Conversational AI Virtual Assistants Speech Synthesis Human-Computer Interaction Lack of naturalness in LLM-generated speechAbsence of disfluencies in synthesized speechPerceived artificiality of AI voices Generating natural-sounding speechImproving perceived spontaneityEnhancing conversational agent realism

Related Fields

Artificial IntelligenceMachine LearningNatural Language ProcessingSpeech Technology

Keywords

disfluenciesnaturalnessLLM-generated speechtext-to-speechspontaneityconversational agentsLoRAspeech synthesisHCINLP

Academic Context

#Speech Synthesis#Natural Language Generation#Human-Computer Interaction#Conversational AI#Deep Learning

Technology Stack

Frameworks & Libraries

LoRA

Commercial Potential

Potential Products

More natural-sounding virtual assistantsAdvanced TTS engines for audiobooks and content creationTools for generating realistic dialogue for games and simulations

Target Industries

TechnologyMedia and EntertainmentCustomer ServiceGaming

Use Case Examples

Creating a virtual assistant that sounds more like a human during conversationsGenerating audio content (e.g., podcasts, audiobooks) with improved naturalness

Competitive Edge

Offers a novel approach to enhance TTS naturalness by leveraging human speech characteristics (disfluencies) within LLM generation, potentially surpassing current TTS systems in perceived spontaneity.

Market Opportunity

Large and growing market for conversational AI and synthetic media.

Revenue Models

Licensing of TTS technologyAPI services.

Resource Requirements

Compute Needs

Moderate (for LoRA fine-tuning and TTS inference)

Data Requirements

Requires LLM fine-tuning data and TTS models capable of handling disfluencies.

Deployment Constraints

Latency in TTS generation, integration complexity.

Scalability

Scalability depends on the efficiency of the LoRA fine-tuning and the TTS model.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years

Patent Potential

Moderate (novel application of disfluencies in LLM-TTS integration)

View Full Paper Back to Papers