Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper Speech Synthesis Researchers,NLP Engineers,Developers of Global Applications,Linguists 1 week ago

SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution

speech-audio › text-to-speech
📄 Abstract

Abstract: Intra-sentence multilingual speech synthesis (code-switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages. Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed-language contexts. We introduce Script-First Multilingual Synthesis with Adaptive Locale Resolution (SFMS-ALR), an engine-agnostic framework for fluent, real-time code-switched speech generation. SFMS-ALR segments input text by Unicode script, applies adaptive language identification to determine each segment's language and locale, and normalizes prosody using sentiment-aware adjustments to preserve expressive continuity across languages. The algorithm generates a unified SSML representation with appropriate "lang" or "voice" spans and synthesizes the utterance in a single TTS request. Unlike end-to-end multilingual models, SFMS-ALR requires no retraining and integrates seamlessly with existing voices from Google, Apple, Amazon, and other providers. Comparative analysis with data-driven pipelines such as Unicom and Mask LID demonstrates SFMS-ALR's flexibility, interpretability, and immediate deployability. The framework establishes a modular baseline for high-quality, engine-independent multilingual TTS and outlines evaluation strategies for intelligibility, naturalness, and user preference.
Authors (1)
Dharma Teja Donepudi
Submitted
October 27, 2025
arXiv Category
cs.SD
arXiv PDF

Key Contributions

Introduces SFMS-ALR, an engine-agnostic framework for fluent, real-time code-switched speech synthesis that segments text by script, adaptively identifies language/locale, and normalizes prosody with sentiment-aware adjustments. This approach avoids retraining existing TTS systems and integrates seamlessly.

Business Value

Enables more natural and engaging voice interactions for global audiences, improving customer experiences in multilingual support, content localization, and virtual assistants.