Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We introduce Long-VITA, a simple yet effective large multi-modal model for
long-context visual-language understanding tasks. It is adept at concurrently
processing and analyzing modalities of image, video, and text over 4K frames or
1M tokens while delivering advanced performances on short-context multi-modal
tasks. We propose an effective multi-modal training schema that starts with
large language models and proceeds through vision-language alignment, general
knowledge learning, and two sequential stages of long-sequence fine-tuning. We
further implement context-parallelism distributed inference and logits-masked
language modeling head to scale Long-VITA to infinitely long inputs of images
and texts during model inference. Regarding training data, Long-VITA is built
on a mix of 17M samples from public datasets only and demonstrates
state-of-the-art performance on various multi-modal benchmarks, compared
against recent cutting-edge models with internal data. Long-VITA is fully
open-source and reproducible.. By leveraging our inference designs, Long-VITA
models achieve a remarkable 2x prefill speedup and 4x context length extension
in a single node with 8 GPUs. We hope Long-VITA can serve as a competitive
baseline and offer valuable insights for the open-source community in advancing
long-context multi-modal understanding.
Authors (17)
Yunhang Shen
Chaoyou Fu
Shaoqi Dong
Xiong Wang
Yi-Fan Zhang
Peixian Chen
+11 more
Submitted
February 7, 2025
Key Contributions
Long-VITA is a large multimodal model capable of processing extremely long contexts (up to 1M tokens/4K frames) for visual-language tasks. It employs a novel multi-modal training schema and distributed inference techniques to achieve state-of-the-art performance on both long and short-context tasks, using only public datasets.
Business Value
Enables deeper understanding and analysis of complex visual and textual information, opening doors for advanced content summarization, detailed visual search, and more sophisticated AI assistants.