📄 Abstract
Abstract: This work proposes an industry-level omni-modal large language model (LLM)
pipeline that integrates auditory, visual, and linguistic modalities to
overcome challenges such as limited tri-modal datasets, high computational
costs, and complex feature alignments. Our pipeline consists of three main
components: First, a modular framework enabling flexible configuration of
various encoder-LLM-decoder architectures. Second, a lightweight training
strategy that pre-trains audio-language alignment on the state-of-the-art
vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of
vision-specific modalities. Third, an audio synthesis pipeline that generates
high-quality audio-text data from diverse real-world scenarios, supporting
applications such as Automatic Speech Recognition and Speech-to-Speech chat. To
this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive
experiments validate the efficacy of our pipeline, yielding the following key
findings:(1) In the visual understanding task, Nexus exhibits superior
performance compared with its backbone model - Qwen2.5-VL-7B, validating the
efficiency of our training strategy. (2) Within the English Spoken
Question-Answering task, the model achieves better accuracy than the
same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In
our real-world ASR testset, Nexus achieves outstanding performance, indicating
its robustness in real scenarios. (4) In the Speech-to-Text Translation task,
our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task,
based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is
comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth
analysis of tri-modal alignment reveals that incorporating the audio modality
enhances representational alignment between vision and language.
Authors (16)
Che Liu
Yingji Zhang
Dong Zhang
Weijie Zhang
Chenggong Gong
Yu Lu
+10 more
Submitted
February 26, 2025
Key Contributions
Proposes Nexus, an industry-level omni-modal LLM pipeline integrating language, audio, and vision. It features a modular framework, a lightweight training strategy leveraging Qwen2.5-VL for audio-language alignment, and an audio synthesis pipeline, overcoming data and computational challenges.
Business Value
Enables the development of more sophisticated AI assistants and applications that can seamlessly understand and interact using language, audio, and visual information, improving user experience and functionality.