VOODOO XP: Expressive One-Shot Head Reenactment for VR Telepresence

SIGGRAPH Asia 2024 (Journal Track)

ACM Transaction on Graphics

We introduce VOODOO XP, a real-time, 3D-aware one-shot head reenactment method that generates expressive facial animations from any driver video and a single 2D portrait. Our solution enables and end-to-end VR telepresence system.

Abstract

We introduce VOODOO XP: a 3D-aware one-shot head reenactment method that can generate highly expressive facial expressions from any input driver video and a single 2D portrait. Our solution is real-time, view-consistent, and can be instantly used without calibration or fine-tuning. We demonstrate our solution on a monocular video setting and an end-to-end VR telepresence system for two-way communication. Compared to 2D head reenactment methods, 3D-aware approaches aim to preserve the identity of the subject and ensure view-consistent facial geometry for novel camera poses, which makes them suitable for immersive applications. While various facial disentanglement techniques have been introduced, cutting-edge 3D-aware neural reenactment techniques still lack expressiveness and fail to reproduce complex and fine-scale facial expressions. We present a novel cross-reenactment architecture that directly transfers the driver's facial expressions to transformer blocks of the input source's 3D lifting module. We show that highly effective disentanglement is possible using an innovative multi-stage self-supervision approach, which is based on a coarse-to-fine strategy, combined with an explicit face neutralization and 3D lifted frontalization during its initial training stage. We further integrate our novel head reenactment solution into an accessible high-fidelity VR telepresence system, where any person can instantly build a personalized neural head avatar from any photo and bring it to life using the headset. We demonstrate state-of-the-art performance in terms of expressiveness and likeness preservation on a large set of diverse subjects and capture conditions.

One-Shot 3D Head Reenactment

Telepresence VR System

Optional Few-shot Fine-tuning

Comparisons with SOTA

Methodology

Overview. We introduce a novel 3D-aware neural head reenactment architecture using a transformer-based expression transfer approach for generating highly expressive facial expressions in real-time. For effective volumetric disentanglement of complex expressions, we propose a multi-stage training approach based on face neutralization and 3D lifted frontalization, coarse-to-fine training, and global fine-tuning. Finally, We present the first end-to-end VR telepresence solution based on a one-shot 3D head reenactment algorithm.

Architecture
3D Lifting and Rendering. We use the state-of-the-art Lp3D model [Trevithick et al. 2023] for 3D lifting, which transforms a 2D input image into a 3D neural radiance field. The model architecture is shown in the lower part of the above figure. To enhance expression and identity generalization, we trained this module with a combination of the Nersemble dataset, which contains millions of diverse expressions, and a synthetic dataset generated by DiffPortrait3D, a diffusion-based 3D lifting model.

Expression Transferring. As shown in previous works, the first branch of Lp3D extracts global features, such as the geometry of the head, while the second branch extracts high-frequency details. Therefore, given a pre-trained Lp3D model, we can modify the expression of a given source image by directly modifying its intermediate features in the low-level branch. To this end, we propose a new architecture for expression transfer as shown in the upper part of the above figure. On top of a pre-trained 3D lifting network, we add a trainable expression transfer module to alter the expression. This module first uses a vision transformer initialized from DINO to extract an expression vector from the driver image. The extracted expression vector is then used to directly modify intermediate low-level features within the 3D lifting module through several cross and self-attention layers, altering the expression of the source image.

Multi-stage Training. To stabilize the self-supervision of our transformer-based expression transfer model, we divide the training into multiple stages. The first stage uses several techniques to prevent identity leakage, so that the identity information of the driver does not neutralized image Volumetric Renderer neutral pose leak into the reenacted image. These methods include driver 3D lifted head frontalization, augmentation, and a novel neutralizing loss to minimize the difference between the neutralized source and the neutralized cross-reenacted image. In the second stage, we introduce a fine-scale training strategy using the model from the first stage to supervise a higher resolution model training with the help of generated synthetic drivers. This approach allows us to disable all constraints used in the previous step that might compromise expression quality, thereby further enhancing the expressiveness of the output. In the third stage, we adopt global fine-tuning, by unfreezing the 3D lifting module and applying GAN loss to increase the high-frequency details of the reenacted images. Additionally, per-subject fine-tuning can be introduced as an optional step to improve the likeness and the expressiveness of the target identity.

Telepresence VR System. VR System Our complete VR telepresence system is illustrated in Fig. 5. For VR-based facial performance capture, we simply use Meta’s Movement SDK [Meta 2024a] and Headset Tracking SDK [Meta 2024b], as input signals to a Unity game engine [Unity 2024] scene with a generic parametric blendshape model. The facial expressions consist of 63 blendshapes (7 for tongue) and 2 gaze controls (angles) for each eye. This generic face model is then animated and rendered using a traditional computer graphics (CG) pipeline to produce a live video stream, which serves as input to our neural head reenactment framework.

Paper Video

Prior VOODOO Publications

[SIGGRAPH 2024 RTL] VOODOO VR: One-Shot Neural Avatars for Virtual Reality [Project page] [Video] [RTL]

[CVPR 2023] VOODOO 3D: Volumetric Portrait Disentanglement for One-Shot 3D Head Reenactment [Project page] [Paper] [Code]

BibTeX

@article{tran2024voodoo,
  title={VOODOO XP: Expressive One-Shot Head Reenactment for VR Telepresence},
  author={Tran, Phong and Zakharov, Egor and Ho, Long-Nhat and Hu, Liwen and Karmanov, Adilbek and Agarwal, Aviral and Goldwhite, McLean and Venegas, Ariana Bermudez and Tran, Anh Tuan and Li, Hao},
  journal={ACM Transactions on Graphics, Proceedings of the 17th ACM SIGGRAPH Conference and Exhibition in Asia 2024, (SIGGRAPH Asia 2024), 12/2024},
  year={2024}
}