arxiv_ai 95% Match Research Paper AI researchers,Machine learning engineers,Developers of multimodal systems,Computer vision and NLP practitioners 2 weeks ago

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

large-language-models › multimodal-llms

📄 Abstract

Abstract: Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

Authors (32)

Hanrong Ye

Chao-Han Huck Yang

Arushi Goel

Wei Huang

Ligeng Zhu

Yuanhang Su

+26 more

Submitted

October 17, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Introduces OmniVinci, an open-source omni-modal LLM, with key architectural innovations (OmniAlignNet, Temporal Embedding Grouping, Constrained Rotary Time Embedding) and a large dataset (24M conversations). It demonstrates that modalities reinforce each other and outperforms existing models on cross-modal understanding tasks.

Business Value

Enables more human-like AI systems capable of understanding complex real-world scenarios involving sight, sound, and language, leading to advancements in areas like intelligent assistants, autonomous systems, and content analysis.

Paper Metadata

Innovation Type

Model Architecture and Data Curation

Deployment Feasibility

Moderate. Developing and deploying large omni-modal models is computationally intensive, but the open-source nature facilitates wider adoption and research.

Limitations Addressed

The challenge of developing AI that can perceive and reason across multiple modalities effectively, and the need for strong alignment between vision, audio, and language.

Performance Gains

+19.05 on DailyOmni compared to Qwen2.5-Omni.

Technical Tags

multimodal LLMomni-modal understandingvision-language modelsaudio-language modelscross-modal alignmenttemporal embeddingstransformer architecturedata curationopen-source AIembedding spaces

Research Topics

Multimodal Machine LearningLarge Language ModelsComputer VisionSpeech ProcessingArtificial IntelligenceDeep Learning Architectures

Methods & Architectures

OmniVinci modelOmniAlignNetTemporal Embedding GroupingConstrained Rotary Time EmbeddingData curation and synthesis pipelineCross-modal alignment techniques TransformerOmniAlignNet

Applications & Tasks

Human-Computer Interaction Robotics Content Understanding Accessibility Multimodal UnderstandingCross-modal ReasoningPerceptionInformation Integration Perceiving and reasoning across vision, audio, and languageStrengthening alignment between different modalitiesEncoding temporal information in multimodal embeddings

Datasets & Benchmarks

Datasets

DailyOmni

Benchmarks

DailyOmni

Cross-modal understanding performanceAlignment scoresPerception accuracyReasoning capabilities

Related Fields

Multimodal AIComputer VisionSpeech ProcessingNatural Language ProcessingDeep Learning

Keywords

OmniVinciMultimodal LLMOmni-modalVision-LanguageAudio-LanguageCross-modalAlignmentTemporal EmbeddingTransformerOpen SourcePerceptionReasoningEmbeddingsQwen2.5-OmniDailyOmni

Academic Context

#Multimodal Machine Learning#Large Language Models#Computer Vision#Speech Processing#Artificial Intelligence#Deep Learning Architectures

Commercial Potential

Potential Products

Advanced AI assistants understanding voice and visionMultimodal content analysis toolsRobotic systems with enhanced environmental perception

Target Industries

TechnologyAI ResearchMedia and EntertainmentRoboticsAutomotive

Use Case Examples

AI that can describe images and answer questions about themSystems that understand spoken commands related to visual informationGenerating summaries of videos with audio commentaryRobots that can perceive and interact with their environment using multiple senses

Competitive Edge

Offers a strong, open-source omni-modal LLM with novel architectural components and significant performance gains over existing multimodal models.

Resource Requirements

Compute Needs

Training and inference for omni-modal LLMs are highly compute-intensive.

Data Requirements

Requires a large and diverse dataset of single-modal and omni-modal conversations (24M generated).

Deployment Constraints

High computational cost, potential challenges in real-time processing for certain applications, and the need for robust multimodal data pipelines.

Scalability

The transformer architecture and large-scale training suggest good scalability.

Production Readiness

Maturity Level

Research/Development

Licensing

Open-source

View Full Paper Back to Papers