arxiv_cv 90% Match Research Paper AI Researchers,Computer Animators,Game Developers,VR/AR Content Creators 2 weeks ago

OmniMotion-X: Versatile Multimodal Whole-Body Motion Generation

generative-ai › diffusion

📄 Abstract

Abstract: This paper introduces OmniMotion-X, a versatile multimodal framework for whole-body human motion generation, leveraging an autoregressive diffusion transformer in a unified sequence-to-sequence manner. OmniMotion-X efficiently supports diverse multimodal tasks, including text-to-motion, music-to-dance, speech-to-gesture, and global spatial-temporal control scenarios (e.g., motion prediction, in-betweening, completion, and joint/trajectory-guided synthesis), as well as flexible combinations of these tasks. Specifically, we propose the use of reference motion as a novel conditioning signal, substantially enhancing the consistency of generated content, style, and temporal dynamics crucial for realistic animations. To handle multimodal conflicts, we introduce a progressive weak-to-strong mixed-condition training strategy. To enable high-quality multimodal training, we construct OmniMoCap-X, the largest unified multimodal motion dataset to date, integrating 28 publicly available MoCap sources across 10 distinct tasks, standardized to the SMPL-X format at 30 fps. To ensure detailed and consistent annotations, we render sequences into videos and use GPT-4o to automatically generate structured and hierarchical captions, capturing both low-level actions and high-level semantics. Extensive experimental evaluations confirm that OmniMotion-X significantly surpasses existing methods, demonstrating state-of-the-art performance across multiple multimodal tasks and enabling the interactive generation of realistic, coherent, and controllable long-duration motions.

Authors (8)

Guowei Xu

Yuxuan Bian

Ailing Zeng

Mingyi Shi

Shaoli Huang

Wen Li

+2 more

Submitted

October 22, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

OmniMotion-X is a versatile multimodal framework for whole-body human motion generation using an autoregressive diffusion transformer. It supports diverse tasks (text, music, speech to motion) and introduces reference motion conditioning for consistency. It also includes a new large multimodal motion dataset, OmniMoCap-X, and a mixed-condition training strategy.

Business Value

Revolutionizes the creation of realistic human animations for various media, reducing manual effort and enabling more dynamic and interactive virtual experiences.

Paper Metadata

Innovation Type

Algorithmic and Dataset

Deployment Feasibility

Moderate. Requires significant computational resources for training and inference, but the framework's versatility is a major advantage.

Limitations Addressed

Generating diverse and consistent whole-body human motion,Handling multiple modalities (text, audio, motion) simultaneously,Lack of large-scale, unified multimodal motion datasets,Ensuring temporal consistency and style

Technical Tags

human motion generationmultimodal AIautoregressive diffusion transformersequence-to-sequencetext-to-motionmusic-to-dancespeech-to-gesturemotion predictionreference motion conditioningdataset creation

Research Topics

Generative AIHuman Motion SynthesisMultimodal LearningComputer AnimationDeep Learning

Methods & Architectures

OmniMotion-X frameworkAutoregressive Diffusion TransformerReference motion conditioningProgressive weak-to-strong mixed-condition trainingOmniMoCap-X dataset construction Autoregressive Diffusion TransformerSequence-to-Sequence Models

Applications & Tasks

Computer Animation Virtual Reality Gaming Robotics (motion planning) Human-Computer Interaction Generating realistic whole-body human motionHandling diverse multimodal inputsEnsuring consistency in generated motionTraining on large multimodal motion datasets Multimodal Motion GenerationText-to-Motion SynthesisMusic-to-Dance SynthesisSpeech-to-Gesture SynthesisMotion Prediction and Completion

Datasets & Benchmarks

Datasets

OmniMoCap-X

Related Fields

Generative ModelsComputer VisionNatural Language ProcessingAnimationRobotics

Keywords

human motion generationmultimodal AIdiffusion modelstransformersanimationtext-to-motionmusic-to-dancespeech-to-gesturemotion synthesisdataset

Academic Context

#Generative AI#Human Motion Synthesis#Multimodal Learning#Computer Animation#Deep Learning

Commercial Potential

Potential Products

AI-powered animation softwareTools for generating virtual characters' movementsInteractive virtual reality experiences

Target Industries

GamingFilm and AnimationVirtual RealityAugmented RealityRobotics

Use Case Examples

Generating realistic character animations for video games based on dialogue or musicCreating virtual avatars that respond dynamically to user inputSynthesizing human-like motion for humanoid robots

Competitive Edge

Provides a unified and highly versatile framework for multimodal motion generation, surpassing previous methods in scope and consistency through novel conditioning and training strategies.

Market Opportunity

Large and growing market for animation, virtual content creation, and AI-driven character generation.

Revenue Models

Licensing of the motion generation engineintegration into animation softwareSaaS for content creation.

Resource Requirements

Compute Needs

Very high, for training large diffusion transformer models on extensive multimodal data.

Data Requirements

Requires large-scale, diverse, and well-annotated multimodal motion capture datasets (like OmniMoCap-X).

Deployment Constraints

Inference can be computationally intensive, requiring powerful hardware for real-time applications.

Scalability

The autoregressive nature and transformer architecture allow for handling long sequences, but computational cost scales.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

High, for the OmniMotion-X architecture, conditioning methods, and dataset.

View Full Paper Back to Papers