Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent Multimodal Large Language Models (MLLMs) have typically focused on
integrating visual and textual modalities, with less emphasis placed on the
role of speech in enhancing interaction. However, speech plays a crucial role
in multimodal dialogue systems, and implementing high-performance in both
vision and speech tasks remains a significant challenge due to the fundamental
modality differences. In this paper, we propose a carefully designed
multi-stage training methodology that progressively trains LLM to understand
both visual and speech information, ultimately enabling fluent vision and
speech interaction. Our approach not only preserves strong vision-language
capacity, but also enables efficient speech-to-speech dialogue capabilities
without separate ASR and TTS modules, significantly accelerating multimodal
end-to-end response speed. By comparing our method against state-of-the-art
counterparts across benchmarks for image, video, and speech tasks, we
demonstrate that our model is equipped with both strong visual and speech
capabilities, making near real-time vision and speech interaction. Code has
been released at https://github.com/VITA-MLLM/VITA.
Authors (16)
Chaoyou Fu
Haojia Lin
Xiong Wang
Yi-Fan Zhang
Yunhang Shen
Xiaoyu Liu
+10 more
Submitted
January 3, 2025
Key Contributions
Introduces VITA-1.5, a framework aiming for GPT-4o level real-time vision and speech interaction. It proposes a multi-stage training methodology for LLMs to understand both visual and speech information, enabling fluent interaction and efficient speech-to-speech dialogue without separate ASR/TTS modules, significantly improving response speed.
Business Value
Revolutionizes human-computer interaction by enabling natural, real-time conversations that seamlessly integrate vision and speech, leading to more intuitive and powerful AI assistants and interfaces.