Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Achieving disentangled control over multiple facial motions and accommodating
diverse input modalities greatly enhances the application and entertainment of
the talking head generation. This necessitates a deep exploration of the
decoupling space for facial features, ensuring that they a) operate
independently without mutual interference and b) can be preserved to share with
different modal inputs, both aspects often neglected in existing methods. To
address this gap, this paper proposes EDTalk++, a novel full disentanglement
framework for controllable talking head generation. Our framework enables
individual manipulation of mouth shape, head pose, eye movement, and emotional
expression, conditioned on video or audio inputs. Specifically, we employ four
lightweight modules to decompose the facial dynamics into four distinct latent
spaces representing mouth, pose, eye, and expression, respectively. Each space
is characterized by a set of learnable bases whose linear combinations define
specific motions. To ensure independence and accelerate training, we enforce
orthogonality among bases and devise an efficient training strategy to allocate
motion responsibilities to each space without relying on external knowledge.
The learned bases are then stored in corresponding banks, enabling shared
visual priors with audio input. Furthermore, considering the properties of each
space, we propose an Audio-to-Motion module for audio-driven talking head
synthesis. Experiments are conducted to demonstrate the effectiveness of
EDTalk++.