Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Advancing machine intelligence requires developing the ability to perceive
across multiple modalities, much as humans sense the world. We introduce
OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We
carefully study the design choices across model architecture and data curation.
For model architecture, we present three key innovations: (i) OmniAlignNet for
strengthening alignment between vision and audio embeddings in a shared
omni-modal latent space; (ii) Temporal Embedding Grouping for capturing
relative temporal alignment between vision and audio signals; and (iii)
Constrained Rotary Time Embedding for encoding absolute temporal information in
omni-modal embeddings. We introduce a curation and synthesis pipeline that
generates 24M single-modal and omni-modal conversations. We find that
modalities reinforce one another in both perception and reasoning. Our model,
OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal
understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while
using just 0.2T training tokens - a 6 times reduction compared to
Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream
applications spanning robotics, medical AI, and smart factory.
Authors (32)
Hanrong Ye
Chao-Han Huck Yang
Arushi Goel
Wei Huang
Ligeng Zhu
Yuanhang Su
+26 more
Submitted
October 17, 2025
Key Contributions
Introduces OmniVinci, an open-source omni-modal LLM, with key architectural innovations (OmniAlignNet, Temporal Embedding Grouping, Constrained Rotary Time Embedding) and a large dataset (24M conversations). It demonstrates that modalities reinforce each other and outperforms existing models on cross-modal understanding tasks.
Business Value
Enables more human-like AI systems capable of understanding complex real-world scenarios involving sight, sound, and language, leading to advancements in areas like intelligent assistants, autonomous systems, and content analysis.