1. w./w.o. non-face reg loss
We can see that with the non-face reg loss, the rendered head is more temporal stable and realistic.
Generating talking person portraits given arbitrary speech input is a crucial problem in the field of digital human. A modern talking face generation system is expected to achieve the goals of generalized audio-lip synchronization, good video quality, and high system efficiency. Recently, neural radiance field (NeRF) has become a popular rendering technique in this field since it could achieve high-fidelity and 3D-consistent talking face generation with a few-minute-long training video. However, there still exist several challenges for NeRF-based methods: 1) as for the lip synchronization, it is hard to generate a long facial motion sequence of high lip accuracy and temporal consistency; 2) as for the video quality, due to the limited data used to train the renderer, it is vulnerable to out-of-domain motion condition and produce bad rendering results occasionally; 3) as for the system efficiency, the slow training and inference speed of the vanilla NeRF severely obstruct its usage in real-world applications.
In this paper, we propose GeneFace++ to handle these challenges by: 1) utilizing the pitch contour as an auxiliary feature and introducing a temporal loss in the facial motion prediction process; 2) proposing a landmark locally linear embedding method to regulate the outliers in the predicted motion sequence to avoid robustness issues; 3) designing a instant motion-to-video renderer to achieves fast training and real-time inference. With these settings, GeneFace++ becomes the first NeRF-based method that achieves stable and nearly 2-times real-time (45 FPS on RTX3090Ti) talking face generation with generalized audio-lip synchronization. Extensive experiments show that our method outperforms state-of-the-art baselines in terms of subjective and objective evaluation.
We provide a demo video in which our GeneFace++ of 2 identities (May and Obama) are driven by audio clips from 6 languages (English, Chinese, French, German, Korean, and Japanese), to show that we achieve the three goals of modern talking face system:
To further show the good lip-sync generalizability of GeneFace++, in the following video, we provide a hard case, in which all methods are driven by a three-minute-long song.
We can see that with the non-face reg loss, the rendered head is more temporal stable and realistic.
Click the image below to open a html that could control a slider to compare two images: