GeneFace++: Generalized and Stable

Real-Time 3D Talking Face Generation

Anonymous Authors


Generating talking person portraits given arbitrary speech input is a crucial problem in the field of digital human. A modern talking face generation system is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. However, there still exist several challenges for NeRF-based methods: 1) as for the lip synchronization, it is hard to generate a long facial motion sequence of high lip accuracy and temporal consistency; 2) as for the video quality, due to the limited data used to train the renderer, it is vulnerable to out-of-domain motion condition and produce bad rendering results occasionally; 3) as for the system efficiency, the slow training and inference speed of the vanilla NeRF severely obstruct its usage in real-world applications.

In this paper, we propose GeneFace++ to handle these challenges by: 1) utilizing the pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process; 2) proposing a landmark locally linear embedding method to regulate the outliers in the predicted motion sequence to avoid robustness issues; 3) designing a instant motion-to-video renderer to achieves fast training and real-time inference. With these settings, GeneFace++ becomes the first NeRF-based method that achieves stable and nearly 2-times real-time (45 FPS on RTX3090Ti) talking face generation with generalized audio-lip synchronization. Extensive experiments show that our method outperforms state-of-the-art baselines in terms of subjective and objective evaluation.

Overall Pipeline

The inference process of GeneFace++ is shown as follows: The inference process of Real3D-Portrait.

Demo 1: GeneFace++ of 2 identites driven by 6 languages

We provide a demo video in which our GeneFace++ of 2 identities (May and Obama) are driven by audio clips from 6 languages (English, Chinese, French, German, Korean, and Japanese), to show that we achieve the three goals of modern talking face system:

  1. generalized lip synchornization (We generate lip-sync results to 6 languages);
  2. good video quality (We generate high-qualty videos with rich identity-specific details and 3D consistency);
  3. high system efficency (The 512x512-resolution video frames are generated by our instant motion-to-video model in 2-times real-time speed (45 FPS on RTX3090 and 60 FPS on A100).

Demo 2: All methods driven by a 3-minute-long song

To further show the good lip-sync generalizability of GeneFace++, in the following video, we provide a hard case, in which all methods are driven by a three-minute-long song.

Demo 3: Ablations on the Instant Motion-to-Video renderer

We visualize the following ablations: 1) whether non-face regularization loss in the training NeRF stage could address the temporal jittering problem in the non-face area; 2) whether the SR module improves the image fidelity of the NeRF renderer while keeping the 3D consistency.

1. w./w.o. non-face reg loss

We can see that with the non-face reg loss, the rendered head is more temporal stable and realistic.

2. w./w.o. SR module

Click the image below to open a html that could control a slider to compare two images:

Demo 4: Text-driven talking face generation

In the following video, we show the potential of our GeneFace++ to achieve text-driven talking face generation. We first use a zero-shot TTS model with the Obama's voice as prompts to generate the audio track given arbitrary text inputs, then use the synthesized audio track to drive our GeneFace++ to obtain the video track. Both of the left and right sub-videos are synthesized by GeneFace++, yet given different head pose sequence.