arxiv_ai 95% Match Research Paper Musicians,Music Producers,AI Researchers,Audio Engineers 2 weeks ago

LeVo: High-Quality Song Generation with Multi-Preference Alignment

speech-audio › audio-generation

📄 Abstract

Abstract: Recent advances in large language models (LLMs) and audio language models have significantly improved music generation, particularly in lyrics-to-song generation. However, existing approaches still struggle with the complex composition of songs and the scarcity of high-quality data, leading to limitations in audio quality, musicality, instruction following, and vocal-instrument harmony. To address these challenges, we introduce LeVo, a language model based framework consisting of LeLM and Music Codec. LeLM is capable of parallel modeling of two types of tokens: mixed tokens, which represent the combined audio of vocals and accompaniment to achieve better vocal-instrument harmony, and dual-track tokens, which separately encode vocals and accompaniment for high-quality song generation. It employs two decoder-only transformers and a modular extension training strategy to prevent interference between different token types. To further enhance musicality and instruction following ability, we introduce a multi-preference alignment method based on Direct Preference Optimization (DPO). This method handles diverse human preferences through a semi-automatic data construction process and post-training. Experimental results demonstrate that LeVo significantly outperforms existing open-source methods in both objective and subjective metrics, while performing competitively with industry systems. Ablation studies further justify the effectiveness of our designs. Audio examples and source code are available at https://levo-demo.github.io and https://github.com/tencent-ailab/songgeneration.

Authors (13)

Shun Lei

Yaoxun Xu

Zhiwei Lin

Huaicheng Zhang

Wei Tan

Hangting Chen

+7 more

Submitted

June 9, 2025

arXiv Category

cs.SD

arXiv PDF

Key Contributions

Introduces LeVo, a framework for high-quality song generation that addresses limitations in audio quality, musicality, and harmony. It uses LeLM with parallel modeling of mixed and dual-track tokens, alongside a Music Codec, enabling better vocal-instrument harmony and instruction following through modular extension training.

Business Value

Democratizes music creation by providing tools for generating high-quality songs, potentially lowering production costs and enabling new forms of artistic expression.

Paper Metadata

Innovation Type

Algorithmic Framework

Deployment Feasibility

Moderate, requires significant computational resources for training and inference of large audio models.

Limitations Addressed

Addresses challenges in complex song composition, scarcity of high-quality data, limitations in audio quality, musicality, instruction following, and vocal-instrument harmony in existing lyrics-to-song generation models.

Technical Tags

song generationlyrics-to-songlarge language models (LLMs)audio language modelsvocal-instrument harmonydual-track tokensdecoder-only transformersmodular extension trainingmusicalityinstruction following

Research Topics

Music GenerationAI Music CompositionAudio SynthesisLLM ApplicationsMultimodal AI

Methods & Architectures

LeVo frameworkLeLM (Language Model)Music CodecParallel Token Modeling (mixed & dual-track)Modular Extension Training Large Language Models (LLMs)Decoder-only Transformers

Applications & Tasks

Music Production Audio Synthesis Creative AI Entertainment High-Quality Song GenerationVocal-Instrument HarmonyInstruction Following in MusicData Scarcity Generating Songs from LyricsCreating High-Fidelity Music

Related Fields

Music TechnologyArtificial IntelligenceDeep LearningAudio ProcessingNatural Language Processing

Keywords

Song GenerationMusic AIAudio GenerationLLMsLyrics-to-SongVocal SynthesisMusic ProductionDeep LearningTransformer ModelsAI Music

Academic Context

#Music Generation#AI Music Composition#Audio Synthesis#LLM Applications#Multimodal AI

Commercial Potential

Potential Products

AI-powered music composition toolsPersonalized song generation servicesRoyalty-free music generation platforms

Target Industries

Music IndustryGamingFilm and TelevisionAdvertising

Use Case Examples

Generating background music for videosCreating custom songs based on user promptsAssisting songwriters with melody and harmony generation

Competitive Edge

Aims to achieve higher audio quality, musicality, and better harmony compared to existing lyrics-to-song generation models.

Market Opportunity

Growing interest in AI-generated music and audio content.

Revenue Models

SaaS for music creationlicensing of generated musicAPI access.

Resource Requirements

Compute Needs

Very High (for training large audio models)

Data Requirements

Large datasets of songs with lyrics and audio.

Deployment Constraints

Computational cost, latency for real-time generation.

Scalability

Scales with model size and complexity of musical structures.

Production Readiness

Maturity Level

Research

Time to Market

3-5 years

Patent Potential

Moderate

View Full Paper Back to Papers