Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cv 95% Match Research Paper AI Researchers,ML Engineers,Developers of multimodal applications,Data Scientists 1 week ago

Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy

large-language-models › multimodal-llms
📄 Abstract

Abstract: We introduce Long-VITA, a simple yet effective large multi-modal model for long-context visual-language understanding tasks. It is adept at concurrently processing and analyzing modalities of image, video, and text over 4K frames or 1M tokens while delivering advanced performances on short-context multi-modal tasks. We propose an effective multi-modal training schema that starts with large language models and proceeds through vision-language alignment, general knowledge learning, and two sequential stages of long-sequence fine-tuning. We further implement context-parallelism distributed inference and logits-masked language modeling head to scale Long-VITA to infinitely long inputs of images and texts during model inference. Regarding training data, Long-VITA is built on a mix of 17M samples from public datasets only and demonstrates state-of-the-art performance on various multi-modal benchmarks, compared against recent cutting-edge models with internal data. Long-VITA is fully open-source and reproducible.. By leveraging our inference designs, Long-VITA models achieve a remarkable 2x prefill speedup and 4x context length extension in a single node with 8 GPUs. We hope Long-VITA can serve as a competitive baseline and offer valuable insights for the open-source community in advancing long-context multi-modal understanding.
Authors (17)
Yunhang Shen
Chaoyou Fu
Shaoqi Dong
Xiong Wang
Yi-Fan Zhang
Peixian Chen
+11 more
Submitted
February 7, 2025
arXiv Category
cs.CV
arXiv PDF

Key Contributions

Long-VITA is a large multimodal model capable of processing extremely long contexts (up to 1M tokens/4K frames) for visual-language tasks. It employs a novel multi-modal training schema and distributed inference techniques to achieve state-of-the-art performance on both long and short-context tasks, using only public datasets.

Business Value

Enables deeper understanding and analysis of complex visual and textual information, opening doors for advanced content summarization, detailed visual search, and more sophisticated AI assistants.