Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 85% Match Research Paper Music producers,Audio engineers,AI researchers in speech synthesis,Game developers 2 weeks ago

R2-SVC: Towards Real-World Robust and Expressive Zero-shot Singing Voice Conversion

speech-audio › text-to-speech
📄 Abstract

Abstract: In real-world singing voice conversion (SVC) applications, environmental noise and the demand for expressive output pose significant challenges. Conventional methods, however, are typically designed without accounting for real deployment scenarios, as both training and inference usually rely on clean data. This mismatch hinders practical use, given the inevitable presence of diverse noise sources and artifacts from music separation. To tackle these issues, we propose R2-SVC, a robust and expressive SVC framework. First, we introduce simulation-based robustness enhancement through random fundamental frequency ($F_0$) perturbations and music separation artifact simulations (e.g., reverberation, echo), substantially improving performance under noisy conditions. Second, we enrich speaker representation using domain-specific singing data: alongside clean vocals, we incorporate DNSMOS-filtered separated vocals and public singing corpora, enabling the model to preserve speaker timbre while capturing singing style nuances. Third, we integrate the Neural Source-Filter (NSF) model to explicitly represent harmonic and noise components, enhancing the naturalness and controllability of converted singing. R2-SVC achieves state-of-the-art results on multiple SVC benchmarks under both clean and noisy conditions.
Authors (4)
Junjie Zheng
Gongyu Chen
Chaofan Ding
Zihao Chen
Submitted
October 23, 2025
arXiv Category
cs.SD
arXiv PDF

Key Contributions

R2-SVC is a novel framework for robust and expressive singing voice conversion (SVC) designed for real-world applications. It tackles noise and expressiveness challenges by introducing simulation-based robustness enhancement (e.g., F0 perturbations, artifact simulations) and leveraging domain-specific singing data, significantly improving performance in noisy conditions.

Business Value

Enables the creation of more realistic and versatile singing voice synthesis tools for music production, gaming, and virtual entertainment, potentially lowering production costs and enabling new creative possibilities.