arxiv_ml 85% Match Research Paper Audio engineers,AI researchers in audio,Software developers for audio applications,Musicians and content creators 3 days ago

AnyEnhance: A Unified Generative Model with Prompt-Guidance and Self-Critic for Voice Enhancement

generative-ai › diffusion

📄 Abstract

Abstract: We introduce AnyEnhance, a unified generative model for voice enhancement that processes both speech and singing voices. Based on a masked generative model, AnyEnhance is capable of handling both speech and singing voices, supporting a wide range of enhancement tasks including denoising, dereverberation, declipping, super-resolution, and target speaker extraction, all simultaneously and without fine-tuning. AnyEnhance introduces a prompt-guidance mechanism for in-context learning, which allows the model to natively accept a reference speaker's timbre. In this way, it could boost enhancement performance when a reference audio is available and enable the target speaker extraction task without altering the underlying architecture. Moreover, we also introduce a self-critic mechanism into the generative process for masked generative models, yielding higher-quality outputs through iterative self-assessment and refinement. Extensive experiments on various enhancement tasks demonstrate AnyEnhance outperforms existing methods in terms of both objective metrics and subjective listening tests. Demo audios are publicly available at https://amphionspace.github.io/anyenhance. An open-source implementation is provided at https://github.com/viewfinder-annn/anyenhance-v1-ccf-aatc.

Authors (8)

Junan Zhang

Jing Yang

Zihao Fang

Yuancheng Wang

Zehua Zhang

Zhuo Wang

+2 more

Submitted

January 26, 2025

arXiv Category

cs.SD

arXiv PDF

Key Contributions

AnyEnhance is a unified generative model for voice enhancement that handles both speech and singing voices across multiple tasks (denoising, dereverberation, super-resolution, speaker extraction) without fine-tuning. It introduces prompt-guidance for timbre transfer and a self-critic mechanism for iterative refinement, leading to higher-quality outputs.

Business Value

Enables high-quality, versatile voice enhancement for applications like virtual assistants, content creation, and communication tools, improving user experience.

Paper Metadata

Innovation Type

Algorithmic/Model Architecture

Deployment Feasibility

Moderate to High, depending on computational resources for inference.

Limitations Addressed

Need for separate models for different enhancement tasks,Lack of fine-tuning for speaker characteristics,Suboptimal output quality in generative audio models

Performance Gains

Achieves higher-quality outputs through self-criticism and prompt-guidance.

Technical Tags

Voice EnhancementGenerative ModelsMasked Generative ModelPrompt GuidanceSelf-Critic MechanismSpeech ProcessingSinging VoiceDenoisingDereverberationSuper-ResolutionSpeaker ExtractionIn-context LearningTimbre Transfer

Research Topics

Audio Signal ProcessingGenerative AI for AudioSpeech Synthesis and EnhancementDeep Learning for Audio

Methods & Architectures

Masked Generative ModelPrompt-GuidanceSelf-Critic MechanismGenerative Adversarial Networks (GANs) - implied by generative natureDiffusion Models - implied by generative nature Masked Generative Model

Applications & Tasks

Audio Processing Speech Technology Music Production Telecommunications Improving audio qualityEnhancing speech and singing voicesExtracting specific audio characteristics DenoisingDereverberationDeclippingSuper-resolutionTarget speaker extractionVoice enhancement for speech and singing

Related Fields

Digital Signal ProcessingMachine LearningNatural Language Processing (for prompt guidance)Audio Engineering

Keywords

Voice EnhancementGenerative ModelSpeechSingingDenoisingDereverberationSpeaker ExtractionPrompt GuidanceSelf-CriticAudio QualityTimbre TransferMasked Generative Model

Academic Context

#Audio Signal Processing#Generative AI for Audio#Speech Synthesis and Enhancement#Deep Learning for Audio

Commercial Potential

Potential Products

Real-time voice enhancement plugins for DAWsAPIs for voice enhancement servicesConsumer audio enhancement apps

Target Industries

Media and EntertainmentTelecommunicationsSoftware DevelopmentGaming

Use Case Examples

Cleaning up noisy recordings of speech or singingSeparating a target voice from background music or other speakersImproving the clarity of voice calls

Competitive Edge

Offers a unified, versatile solution for multiple voice enhancement tasks with advanced features like prompt-guided timbre transfer.

Market Opportunity

Growing market for audio processing and AI-driven content creation tools.

Revenue Models

Licensing of the model/APIintegration into software products.

Resource Requirements

Compute Needs

Moderate to high for training, moderate for inference.

Data Requirements

Large datasets of speech and singing audio, potentially with corresponding clean/noisy pairs.

Deployment Constraints

Latency requirements for real-time applications.

Scalability

Scalable for processing large volumes of audio data.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-2 years for specialized applications.

Patent Potential

Potential for patents on novel generative mechanisms and prompt-guidance techniques.

View Full Paper Back to Papers