Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: In recent years, the talking head generation has become a focal point for
researchers. Considerable effort is being made to refine lip-sync motion,
capture expressive facial expressions, generate natural head poses, and achieve
high-quality video. However, no single model has yet achieved equivalence
across all quantitative and qualitative metrics. We introduce Jamba, a hybrid
Transformer-Mamba model, to animate a 3D face. Mamba, a pioneering Structured
State Space Model (SSM) architecture, was developed to overcome the limitations
of conventional Transformer architectures, particularly in handling long
sequences. This challenge has constrained traditional models. Jamba combines
the advantages of both the Transformer and Mamba approaches, offering a
comprehensive solution. Based on the foundational Jamba block, we present
JambaTalk to enhance motion variety and lip sync through multimodal
integration. Extensive experiments reveal that our method achieves performance
comparable or superior to state-of-the-art models.