arxiv_ai 90% Match Research Paper Speech processing researchers,AI researchers,Audio engineers,Developers of voice applications 2 weeks ago

UniSE: A Unified Framework for Decoder-only Autoregressive LM-based Speech Enhancement

speech-audio › audio-generation

📄 Abstract

Abstract: The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LMbased models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks. The demo page is available here: https://github.com/hyyan2k/UniSE.

Authors (5)

Haoyin Yan

Chengwei Liu

Shaofei Xue

Xiaotao Liang

Zheng Xue

Submitted

October 23, 2025

arXiv Category

cs.SD

arXiv PDF Code

Key Contributions

UniSE proposes a unified framework for various speech enhancement tasks using a decoder-only autoregressive language model. It demonstrates that LMs can effectively handle diverse SE sub-tasks by generating discrete tokens conditioned on input speech features, showing competitive performance against specialized baselines.

Business Value

Enables more versatile and potentially higher-quality audio processing solutions for applications like voice assistants, call center noise reduction, and audio editing, by using a single model for multiple tasks.

Paper Metadata

Innovation Type

Framework Design

Deployment Feasibility

Moderate. LM inference can be computationally intensive, but advancements in efficient LMs and hardware are improving feasibility.

Limitations Addressed

Addresses the lack of verification for autoregressive LM-based models in unifying different speech enhancement sub-tasks, showing their effectiveness across restoration, extraction, and separation.

View Code on GitHub

Technical Tags

Speech EnhancementAutoregressive LMDecoder-onlyUnified FrameworkNeural Audio CodecsSpeech RestorationSpeaker ExtractionSpeech SeparationDiscrete TokensConditional Generation

Research Topics

Speech ProcessingNatural Language ProcessingGenerative ModelsAudio Signal ProcessingUnified AI Frameworks

Methods & Architectures

Decoder-only Autoregressive Language ModelDiscrete Token GenerationConditional GenerationUnified Framework for SE tasks Decoder-only Language Model

Applications & Tasks

Telecommunications Audio Processing Media Production Assistive Technologies Signal EnhancementTask UnificationGenerative Modeling Speech RestorationTarget Speaker ExtractionSpeech Separation

Related Fields

Speech ProcessingNatural Language ProcessingGenerative AIAudio EngineeringMachine Learning

Keywords

Speech EnhancementLanguage ModelsAutoregressive ModelsDecoder-onlyUnified FrameworkSpeech RestorationSpeaker ExtractionSpeech SeparationNeural Audio CodecsDiscrete TokensConditional GenerationAudio Processing

Academic Context

#Speech Processing#Natural Language Processing#Generative Models#Audio Signal Processing#Unified AI Frameworks

Technology Stack

Frameworks & Libraries

PyTorch

Programming Languages

Python

Commercial Potential

Potential Products

Universal audio enhancement softwareAdvanced noise cancellation pluginsSpeech enhancement APIs

Target Industries

TelecommunicationsMedia and EntertainmentSoftware DevelopmentConsumer Electronics

Use Case Examples

Improving call quality in noisy environmentsSeparating individual voices from a group conversationRestoring degraded audio recordings

Competitive Edge

Offers a unified approach to multiple SE tasks, potentially simplifying development and improving performance compared to using separate models for each task.

Market Opportunity

Significant market for audio processing and enhancement tools.

Revenue Models

Licensing of the framework/modelsAPI services.

Resource Requirements

Compute Needs

High (for training and inference of large LMs)

Data Requirements

Large datasets of clean and noisy speech pairs for various SE tasks.

Deployment Constraints

Inference latency and computational cost of large LMs.

Scalability

Scalability depends on the underlying LM architecture and available compute resources. Model compression techniques could improve scalability.

Regulatory Considerations

None explicitly mentionedbut audio quality standards may apply.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-2 years

Licensing

Open Source (based on GitHub availability)

Patent Potential

Moderate (novel framework design)

View Full Paper Back to Papers