VOODOO XP: Expressive One-Shot Head Reenactment for VR Telepresence

SIGGRAPH Asia 2024 (Journal Track)

ACM Transaction on Graphics

Phong Tran¹, Egor Zakharov², Long Nhat Ho¹, Liwen Hu⁴, Adilbek Karmanov¹, Aviral Agarwal⁴, McLean Goldwhite⁴, Ariana Bermudez Venegas¹, Anh Tuan Tran³, Hao Li^1,4

¹MBZUAI, ²ETH Zurich, ³ VinAI Research, ⁴Pinscreen

Paper Code Video

We introduce VOODOO XP, a real-time, 3D-aware one-shot head reenactment method that generates expressive facial animations from any driver video and a single 2D portrait. Our solution enables and end-to-end VR telepresence system.

One-Shot 3D Head Reenactment

Telepresence VR System

Optional Few-shot Fine-tuning

Comparisons with SOTA

Methodology

Overview. We introduce a novel 3D-aware neural head reenactment architecture using a transformer-based expression transfer approach for generating highly expressive facial expressions in real-time. For effective volumetric disentanglement of complex expressions, we propose a multi-stage training approach based on face neutralization and 3D lifted frontalization, coarse-to-fine training, and global fine-tuning. Finally, We present the first end-to-end VR telepresence solution based on a one-shot 3D head reenactment algorithm.

3D Lifting and Rendering. We use the state-of-the-art Lp3D model [Trevithick et al. 2023] for 3D lifting, which transforms a 2D input image into a 3D neural radiance field. The model architecture is shown in the lower part of the above figure. To enhance expression and identity generalization, we trained this module with a combination of the Nersemble dataset, which contains millions of diverse expressions, and a synthetic dataset generated by DiffPortrait3D, a diffusion-based 3D lifting model.

Expression Transferring. As shown in previous works, the first branch of Lp3D extracts global features, such as the geometry of the head, while the second branch extracts high-frequency details. Therefore, given a pre-trained Lp3D model, we can modify the expression of a given source image by directly modifying its intermediate features in the low-level branch. To this end, we propose a new architecture for expression transfer as shown in the upper part of the above figure. On top of a pre-trained 3D lifting network, we add a trainable expression transfer module to alter the expression. This module first uses a vision transformer initialized from DINO to extract an expression vector from the driver image. The extracted expression vector is then used to directly modify intermediate low-level features within the 3D lifting module through several cross and self-attention layers, altering the expression of the source image.

Multi-stage Training. To stabilize the self-supervision of our transformer-based expression transfer model, we divide the training into multiple stages. The first stage uses several techniques to prevent identity leakage, so that the identity information of the driver does not neutralized image Volumetric Renderer neutral pose leak into the reenacted image. These methods include driver 3D lifted head frontalization, augmentation, and a novel neutralizing loss to minimize the difference between the neutralized source and the neutralized cross-reenacted image. In the second stage, we introduce a fine-scale training strategy using the model from the first stage to supervise a higher resolution model training with the help of generated synthetic drivers. This approach allows us to disable all constraints used in the previous step that might compromise expression quality, thereby further enhancing the expressiveness of the output. In the third stage, we adopt global fine-tuning, by unfreezing the 3D lifting module and applying GAN loss to increase the high-frequency details of the reenacted images. Additionally, per-subject fine-tuning can be introduced as an optional step to improve the likeness and the expressiveness of the target identity.

Telepresence VR System.

Our complete VR telepresence system is illustrated in Fig. 5. For VR-based facial performance capture, we simply use Meta’s Movement SDK [Meta 2024a] and Headset Tracking SDK [Meta 2024b], as input signals to a Unity game engine [Unity 2024] scene with a generic parametric blendshape model. The facial expressions consist of 63 blendshapes (7 for tongue) and 2 gaze controls (angles) for each eye. This generic face model is then animated and rendered using a traditional computer graphics (CG) pipeline to produce a live video stream, which serves as input to our neural head reenactment framework.

Paper Video

Prior VOODOO Publications

[SIGGRAPH 2024 RTL] VOODOO VR: One-Shot Neural Avatars for Virtual Reality [Project page] [Video] [RTL]

[CVPR 2023] VOODOO 3D: Volumetric Portrait Disentanglement for One-Shot 3D Head Reenactment [Project page] [Paper] [Code]