Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent advances in large language models (LLMs) and audio language models
have significantly improved music generation, particularly in lyrics-to-song
generation. However, existing approaches still struggle with the complex
composition of songs and the scarcity of high-quality data, leading to
limitations in audio quality, musicality, instruction following, and
vocal-instrument harmony. To address these challenges, we introduce LeVo, a
language model based framework consisting of LeLM and Music Codec. LeLM is
capable of parallel modeling of two types of tokens: mixed tokens, which
represent the combined audio of vocals and accompaniment to achieve better
vocal-instrument harmony, and dual-track tokens, which separately encode vocals
and accompaniment for high-quality song generation. It employs two decoder-only
transformers and a modular extension training strategy to prevent interference
between different token types. To further enhance musicality and instruction
following ability, we introduce a multi-preference alignment method based on
Direct Preference Optimization (DPO). This method handles diverse human
preferences through a semi-automatic data construction process and
post-training. Experimental results demonstrate that LeVo significantly
outperforms existing open-source methods in both objective and subjective
metrics, while performing competitively with industry systems. Ablation studies
further justify the effectiveness of our designs. Audio examples and source
code are available at https://levo-demo.github.io and
https://github.com/tencent-ailab/songgeneration.
Authors (13)
Shun Lei
Yaoxun Xu
Zhiwei Lin
Huaicheng Zhang
Wei Tan
Hangting Chen
+7 more
Key Contributions
Introduces LeVo, a framework for high-quality song generation that addresses limitations in audio quality, musicality, and harmony. It uses LeLM with parallel modeling of mixed and dual-track tokens, alongside a Music Codec, enabling better vocal-instrument harmony and instruction following through modular extension training.
Business Value
Democratizes music creation by providing tools for generating high-quality songs, potentially lowering production costs and enabling new forms of artistic expression.