arxiv_ai 95% Match Research Paper Speech AI Researchers,NLP Engineers,Developers of Voice Assistants,HCI Researchers 2 weeks ago

VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model

speech-audio › text-to-speech

📄 Abstract

Abstract: With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal large language model capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.

Authors (14)

Zuwei Long

Yunhang Shen

Chaoyou Fu

Heting Gao

Lijiang Li

Peixian Chen

+8 more

Submitted

May 6, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces VITA-Audio, an end-to-end large speech model designed for fast audio-text token generation in streaming scenarios. It features a lightweight Multiple Cross-modal Token Prediction (MCTP) module to generate multiple audio tokens per forward pass, significantly reducing latency for the first audio token.

Business Value

Enables more natural and responsive human-computer interactions through significantly faster and lower-latency speech generation, improving user experience in voice assistants and real-time communication tools.

Paper Metadata

Innovation Type

Algorithmic/Architecture

Deployment Feasibility

High, designed for efficiency and low latency, suitable for real-time applications.

Limitations Addressed

High latency in generating the first audio token during streaming, which is a significant bottleneck for real-time speech applications.

Performance Gains

Accelerates inference and significantly reduces latency for the first audio token in streaming scenarios.

Technical Tags

speech synthesislarge speech modelslow latencystreaming audiocross-modal token generationmultiple cross-modal token predictionprogressive training

Research Topics

Speech SynthesisNatural Language ProcessingHuman-Computer InteractionDeep LearningReal-time Systems

Methods & Architectures

End-to-end Speech ModelMultiple Cross-modal Token Prediction (MCTP)Progressive Training Strategy Large Speech ModelsEnd-to-end Models

Applications & Tasks

Human-Computer Interaction Voice Assistants Real-time Communication Accessibility High Latency in Speech GenerationStreaming Audio BottlenecksModel EfficiencySpeech Quality Fast Audio Token GenerationLow-Latency Speech SynthesisReal-time Voice Interaction

Related Fields

Speech ProcessingNatural Language ProcessingMachine LearningReal-time SystemsHuman-Computer Interaction

Keywords

speech synthesislarge language modelslow latencystreaming audiocross-modal generationvoice assistantsreal-time interactionaudio generationtext-to-speechMCTP

Academic Context

#Speech Synthesis#Natural Language Processing#Human-Computer Interaction#Deep Learning#Real-time Systems

Commercial Potential

Potential Products

Next-generation voice assistants with near-instantaneous responsesReal-time translation and communication toolsImproved accessibility features for audio content

Target Industries

TechnologyTelecommunicationsCustomer ServiceMedia and Entertainment

Use Case Examples

Enabling seamless voice commands for smart devicesProviding real-time voice feedback in gamingImproving the responsiveness of automated customer service agents

Competitive Edge

Offers a significant improvement in latency for speech generation compared to existing large speech models, crucial for real-time applications.

Market Opportunity

Large and growing market for voice-enabled technologies and AI assistants.

Revenue Models

Licensing of the VITA-Audio modelintegration into voice-enabled products and services.

Resource Requirements

Compute Needs

Moderate to High, depending on model size.

Data Requirements

Requires large-scale speech and text datasets.

Deployment Constraints

Real-time processing capabilities, network bandwidth for streaming.

Scalability

Scalability is addressed through efficient architecture and training strategies.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years

Patent Potential

Moderate, novel architecture for efficient token generation.

View Full Paper Back to Papers