arxiv_ai 95% Match Research Paper Speech Synthesis Researchers,NLP Engineers,Developers of Global Applications,Linguists 1 week ago

SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution

speech-audio › text-to-speech

📄 Abstract

Abstract: Intra-sentence multilingual speech synthesis (code-switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages. Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed-language contexts. We introduce Script-First Multilingual Synthesis with Adaptive Locale Resolution (SFMS-ALR), an engine-agnostic framework for fluent, real-time code-switched speech generation. SFMS-ALR segments input text by Unicode script, applies adaptive language identification to determine each segment's language and locale, and normalizes prosody using sentiment-aware adjustments to preserve expressive continuity across languages. The algorithm generates a unified SSML representation with appropriate "lang" or "voice" spans and synthesizes the utterance in a single TTS request. Unlike end-to-end multilingual models, SFMS-ALR requires no retraining and integrates seamlessly with existing voices from Google, Apple, Amazon, and other providers. Comparative analysis with data-driven pipelines such as Unicom and Mask LID demonstrates SFMS-ALR's flexibility, interpretability, and immediate deployability. The framework establishes a modular baseline for high-quality, engine-independent multilingual TTS and outlines evaluation strategies for intelligibility, naturalness, and user preference.

Authors (1)

Dharma Teja Donepudi

Submitted

October 27, 2025

arXiv Category

cs.SD

arXiv PDF

Key Contributions

Introduces SFMS-ALR, an engine-agnostic framework for fluent, real-time code-switched speech synthesis that segments text by script, adaptively identifies language/locale, and normalizes prosody with sentiment-aware adjustments. This approach avoids retraining existing TTS systems and integrates seamlessly.

Business Value

Enables more natural and engaging voice interactions for global audiences, improving customer experiences in multilingual support, content localization, and virtual assistants.

Paper Metadata

Innovation Type

Framework and Algorithm

Deployment Feasibility

Highly feasible as it's an engine-agnostic framework designed for seamless integration with existing TTS systems.

Limitations Addressed

Abrupt language shifts in code-switching TTS,Varied scripts and mismatched prosody,Need for retraining conventional monolingual TTS systems

Technical Tags

Multilingual Speech SynthesisCode-Switching TTSScript SegmentationAdaptive Language IdentificationProsody NormalizationSentiment-Aware AdjustmentsSSMLEngine-Agnostic Framework

Research Topics

Speech SynthesisNatural Language ProcessingMultilingual AIComputational LinguisticsHuman-Computer Interaction

Methods & Architectures

Script SegmentationAdaptive Language IdentificationProsody NormalizationSentiment AnalysisSSML Generation Custom TTS Engine Integration

Applications & Tasks

Global Communication Customer Service Content Creation Accessibility Code-Switching Speech SynthesisNaturalness in Multilingual TTSProsody Mismatch Real-time Code-Switched Speech GenerationMultilingual Text-to-Speech

Related Fields

Speech ProcessingNatural Language ProcessingComputational LinguisticsMachine Translation

Keywords

Speech SynthesisText-to-SpeechMultilingualCode-SwitchingTTSProsodyLanguage IdentificationSSMLReal-timeFrameworkSentimentUnicodeEngine-Agnostic

Academic Context

#Speech Synthesis#Natural Language Processing#Multilingual AI#Computational Linguistics#Human-Computer Interaction

Technology Stack

Programming Languages

Python

Commercial Potential

Potential Products

Multilingual Voice AssistantsReal-time Translation Voice ServicesLocalized Audio Content Generation Tools

Target Industries

TechnologyMediaTelecommunicationsCustomer Service

Use Case Examples

A customer service chatbot that can seamlessly switch between English and Spanish.Generating audiobooks in multiple languages with natural transitions.

Competitive Edge

Offers a unique engine-agnostic solution that enhances existing TTS systems for code-switching without requiring retraining, differentiating it from end-to-end multilingual models.

Market Opportunity

Significant market for global voice services and multilingual AI applications.

Revenue Models

Licensing the framework to TTS providersintegration into voice platforms.

Resource Requirements

Compute Needs

Minimal compute requirements for the framework itself, as it relies on existing TTS engines.

Data Requirements

Requires access to diverse multilingual speech datasets for evaluating prosody and language characteristics.

Deployment Constraints

Dependency on the quality and capabilities of the underlying TTS engine.

Scalability

Scalable as it integrates with existing TTS systems, leveraging their scalability.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-3 years, for integration and productization.

Patent Potential

Low to moderate, potentially related to the adaptive locale resolution or prosody normalization algorithms.

View Full Paper Back to Papers