arxiv_cv 95% Match Research Paper AI Researchers,ML Engineers,NLP Engineers,Speech Technology Developers 2 weeks ago

Nexus: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision

large-language-models › multimodal-llms

📄 Abstract

Abstract: This work proposes an industry-level omni-modal large language model (LLM) pipeline that integrates auditory, visual, and linguistic modalities to overcome challenges such as limited tri-modal datasets, high computational costs, and complex feature alignments. Our pipeline consists of three main components: First, a modular framework enabling flexible configuration of various encoder-LLM-decoder architectures. Second, a lightweight training strategy that pre-trains audio-language alignment on the state-of-the-art vision-language model Qwen2.5-VL, thus avoiding the costly pre-training of vision-specific modalities. Third, an audio synthesis pipeline that generates high-quality audio-text data from diverse real-world scenarios, supporting applications such as Automatic Speech Recognition and Speech-to-Speech chat. To this end, we introduce an industry-level omni-modal LLM, Nexus. Extensive experiments validate the efficacy of our pipeline, yielding the following key findings:(1) In the visual understanding task, Nexus exhibits superior performance compared with its backbone model - Qwen2.5-VL-7B, validating the efficiency of our training strategy. (2) Within the English Spoken Question-Answering task, the model achieves better accuracy than the same-period competitor (i.e, MiniCPM-o2.6-7B) in the LLaMA Q. benchmark. (3) In our real-world ASR testset, Nexus achieves outstanding performance, indicating its robustness in real scenarios. (4) In the Speech-to-Text Translation task, our model outperforms Qwen2-Audio-Instruct-7B. (5) In the Text-to-Speech task, based on pretrained vocoder (e.g., Fishspeech1.4 or CosyVoice2.0), Nexus is comparable to its backbone vocoder on Seed-TTS benchmark. (6) An in-depth analysis of tri-modal alignment reveals that incorporating the audio modality enhances representational alignment between vision and language.

Authors (16)

Che Liu

Yingji Zhang

Dong Zhang

Weijie Zhang

Chenggong Gong

Yu Lu

+10 more

Submitted

February 26, 2025

arXiv Category

cs.MM

arXiv PDF

Key Contributions

Proposes Nexus, an industry-level omni-modal LLM pipeline integrating language, audio, and vision. It features a modular framework, a lightweight training strategy leveraging Qwen2.5-VL for audio-language alignment, and an audio synthesis pipeline, overcoming data and computational challenges.

Business Value

Enables the development of more sophisticated AI assistants and applications that can seamlessly understand and interact using language, audio, and visual information, improving user experience and functionality.

Paper Metadata

Innovation Type

Framework / Training Strategy

Deployment Feasibility

Moderate to High, the modular framework and lightweight training strategy enhance deployability.

Limitations Addressed

Challenges of limited tri-modal datasets, high computational costs, complex feature alignments, and the expense of pre-training vision-specific modalities.

Performance Gains

Achieves state-of-the-art performance in audio-language alignment and enables high-quality audio synthesis.

Technical Tags

omni-modal LLMlanguageaudiovisionmodular frameworklightweight trainingaudio synthesisQwen2.5-VL

Research Topics

Multimodal Large Language ModelsOmni-modal AIAudio-Language IntegrationVision-Language ModelsEfficient Training Strategies

Methods & Architectures

Modular framework for encoder-LLM-decoder architecturesLightweight training strategy (audio-language alignment on VLM)Audio synthesis pipelinePre-training on Qwen2.5-VL Large Language Model (LLM)Omni-modal LLM pipeline

Applications & Tasks

Natural Language Processing Speech Processing Computer Vision AI Research Limited tri-modal datasetsHigh computational costsComplex feature alignmentsCostly pre-training of vision-specific modalities Omni-modal understanding and interactionAutomatic Speech Recognition (ASR)Speech-to-Speech chatMultimodal AI applications

Related Fields

Large Language ModelsMultimodal AISpeech ProcessingComputer VisionMachine Learning

Keywords

multimodal LLMomni-modallanguageaudiovisionfoundation modelQwenASRspeech synthesisdeep learning

Academic Context

#Multimodal Large Language Models#Omni-modal AI#Audio-Language Integration#Vision-Language Models#Efficient Training Strategies

Technology Stack

Frameworks & Libraries

Qwen2.5-VL

Commercial Potential

Potential Products

Advanced conversational AI agentsMultimodal content creation toolsReal-time translation and transcription services

Target Industries

TechnologyCustomer ServiceMediaTelecommunications

Use Case Examples

AI assistants that can see, hear, and speakGenerating audio descriptions for visual contentReal-time multimodal communication platforms

Competitive Edge

Presents a novel, efficient pipeline for omni-modal LLMs, addressing key challenges in data and computation that limit current state-of-the-art models.

Market Opportunity

Rapidly expanding market for multimodal AI and LLMs.

Revenue Models

Licensing of the Nexus pipelineintegration into SaaS productsAPI services.

Resource Requirements

Compute Needs

Significant compute for training, but the lightweight strategy aims to reduce this.

Data Requirements

Tri-modal datasets (language, audio, vision), real-world audio-text data.

Scalability

Modular design and efficient training strategy suggest good scalability.

Production Readiness

Maturity Level

Research

Time to Market

1-2 years for integration into commercial products.

Patent Potential

High, for the modular framework, training strategy, and audio synthesis pipeline.

View Full Paper Back to Papers