Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 98% Match Research Paper AI researchers,Developers of conversational AI,Engineers working on multimodal systems,UX designers 1 week ago

VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction

large-language-models › multimodal-llms
📄 Abstract

Abstract: Recent Multimodal Large Language Models (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction. Code has been released at https://github.com/VITA-MLLM/VITA.
Authors (16)
Chaoyou Fu
Haojia Lin
Xiong Wang
Yi-Fan Zhang
Yunhang Shen
Xiaoyu Liu
+10 more
Submitted
January 3, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Introduces VITA-1.5, a framework aiming for GPT-4o level real-time vision and speech interaction. It proposes a multi-stage training methodology for LLMs to understand both visual and speech information, enabling fluent interaction and efficient speech-to-speech dialogue without separate ASR/TTS modules, significantly improving response speed.

Business Value

Revolutionizes human-computer interaction by enabling natural, real-time conversations that seamlessly integrate vision and speech, leading to more intuitive and powerful AI assistants and interfaces.