arxiv_ai 85% Match Research Paper ML Researchers,Computer Vision Engineers,AI Infrastructure Engineers,Generative AI Developers 2 weeks ago

MUG-V 10B: High-efficiency Training Pipeline for Large Video Generation Models

generative-ai › diffusion

📄 Abstract

Abstract: In recent years, large-scale generative models for visual content (\textit{e.g.,} images, videos, and 3D objects/scenes) have made remarkable progress. However, training large-scale video generation models remains particularly challenging and resource-intensive due to cross-modal text-video alignment, the long sequences involved, and the complex spatiotemporal dependencies. To address these challenges, we present a training framework that optimizes four pillars: (i) data processing, (ii) model architecture, (iii) training strategy, and (iv) infrastructure for large-scale video generation models. These optimizations delivered significant efficiency gains and performance improvements across all stages of data preprocessing, video compression, parameter scaling, curriculum-based pretraining, and alignment-focused post-training. Our resulting model, MUG-V 10B, matches recent state-of-the-art video generators overall and, on e-commerce-oriented video generation tasks, surpasses leading open-source baselines in human evaluations. More importantly, we open-source the complete stack, including model weights, Megatron-Core-based large-scale training code, and inference pipelines for video generation and enhancement. To our knowledge, this is the first public release of large-scale video generation training code that exploits Megatron-Core to achieve high training efficiency and near-linear multi-node scaling, details are available in https://github.com/Shopee-MUG/MUG-V.

Authors (9)

Yongshun Zhang

Zhongyi Fan

Yonghang Zhang

Zhangzikang Li

Weifeng Chen

Zhongwei Feng

+3 more

Submitted

October 20, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

This paper introduces a novel, high-efficiency training framework for large-scale video generation models, optimizing data processing, model architecture, training strategy, and infrastructure. This framework significantly improves efficiency and performance across all training stages, enabling better video generation, particularly for e-commerce applications.

Business Value

Enables more efficient and effective creation of high-quality video content, particularly for e-commerce, potentially reducing production costs and improving customer engagement through personalized video generation.

Paper Metadata

Innovation Type

Methodological

Deployment Feasibility

High, as the focus is on optimizing the training pipeline for efficiency, making large models more accessible to train and deploy.

Limitations Addressed

Addresses the challenges of resource-intensive training, cross-modal text-video alignment, long sequences, and complex spatiotemporal dependencies in large-scale video generation models.

Performance Gains

Significant efficiency gains and performance improvements across all stages; matches state-of-the-art video generators overall; surpasses state-of-the-art on e-commerce-oriented video generation tasks.

Technical Tags

video generationlarge-scale trainingdata processingmodel architecturetraining strategyinfrastructure optimizationtext-video alignmentspatiotemporal dependenciescurriculum learningparameter scaling

Research Topics

Generative AIVideo SynthesisEfficient Model TrainingLarge-Scale Deep LearningComputer Vision

Methods & Architectures

Data preprocessing optimizationVideo compressionParameter scalingCurriculum-based pretrainingAlignment-focused post-training Large video generation models

Applications & Tasks

E-commerce Content creation Media production Resource-intensive trainingCross-modal alignmentLong sequence modelingSpatiotemporal dependency modeling Video generationE-commerce video generation

Related Fields

Computer VisionDeep LearningNatural Language ProcessingMachine Learning Engineering

Keywords

video generationlarge modelstraining efficiencygenerative AIdeep learningcomputer visione-commercedata processingmodel architecturecurriculum learningspatiotemporalalignmentinfrastructure

Academic Context

#Generative AI#Video Synthesis#Efficient Model Training#Large-Scale Deep Learning#Computer Vision

Technology Stack

ML Infrastructure

High-efficiency training pipeline

Data Processing Tools

Data preprocessing optimizationVideo compression

Commercial Potential

Potential Products

Automated video generation platformsPersonalized marketing video tools

Target Industries

E-commerceAdvertisingMedia and EntertainmentMarketing

Use Case Examples

Generating product demonstration videosCreating personalized promotional content

Competitive Edge

Positions itself as a more efficient and performant solution for training large video generation models compared to existing methods, especially for specific applications like e-commerce.

Resource Requirements

Compute Needs

High (implied by 'large-scale' and 'resource-intensive')

Data Requirements

Large-scale datasets for video generation, potentially including text-video pairs.

Scalability

Focuses on scaling training efficiency for large models.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers