GeneFace++: Generalized and Stable

Real-Time 3D Talking Face Generation

Anonymous Authors

Abstract

Generating talking person portraits given arbitrary speech input is a crucial problem in the field of digital humans. A modern talking face generation system is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. Recently, the neural radiance field (NeRF) has become a popular rendering technique in this field since it can achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. However, the NeRF-based methods are challenged with non-generalized lip motion prediction, non-robust rendering results to OOD motion, and low inference efficiency. In this paper, we propose GeneFace++ to handle these challenges by 1) designing a generic audio-to-motion model that utilizes pitch and talking style information to improve temporal consistency and lip accuracy; 2) introducing a landmark locally linear embedding method to post-process the predicted motion sequence to alleviate the visual artifact; 3) proposing an instant motion-to-video renderer to achieve efficient training and real-time inference. With these settings, GeneFace++ becomes the first NeRF-based method that achieves stable and real-time talking face generation with generalized audio-lip synchronization. Extensive experiments show that our method outperforms state-of-the-art baselines in terms of subjective and objective evaluation.

Overall Pipeline

The inference process of GeneFace++ is shown as follows: The inference process of GeneFace++.

Demo 1: GeneFace++ of 2 identites driven by 6 languages

We provide a demo video in which our GeneFace++ of 2 identities (May and Obama) are driven by audio clips from 6 languages (English, Chinese, French, German, Korean, and Japanese), to show that we achieve the three goals of modern talking face system:

  1. generalized lip synchornization (We generate lip-sync results to 6 languages);
  2. good video quality (We generate high-qualty videos with rich identity-specific details and 3D consistency);
  3. high system efficency (The 512x512-resolution video frames are generated by our instant motion-to-video model in 2-times real-time speed (45 FPS on RTX3090 and 60 FPS on A100).

Demo 2: All methods driven by a 3-minute-long song

To further show the good lip-sync generalizability of GeneFace++, in the following video, we provide a hard case, in which all methods are driven by a three-minute-long song.

Demo 3: Ablations on the Instant Motion-to-Video renderer

We visualize the following ablations: 1) whether non-face regularization loss in the training NeRF stage could address the temporal jittering problem in the non-face area; 2) whether the SR module improves the image fidelity of the NeRF renderer while keeping the 3D consistency.

1. w./w.o. non-face reg loss

We can see that with the non-face reg loss, the rendered head is more temporal stable and realistic.

2. w./w.o. SR module

Click the image below to open a html that could control a slider to compare two images:

Demo 4: Text-driven talking face generation

In the following video, we show the potential of our GeneFace++ to achieve text-driven talking face generation. We first use a zero-shot TTS model with the Obama's voice as prompts to generate the audio track given arbitrary text inputs, then use the synthesized audio track to drive our GeneFace++ to obtain the video track. Both of the left and right sub-videos are synthesized by GeneFace++, yet given different head pose sequence.