Neural Human Performer: Learning Generalizable Radiance Fields for Human Performance Rendering

Neural Human Performer

Learning Generalizable Radiance Fields for Human Performance Rendering

NeurIPS 2021 (Spotlight)

1University of North Carolina at Chapel Hill     2Korea Advanced Institute of Science and Technology     3Adobe Research

Neural Human Performer can reconstruct a performance of an unseen subject with unseen poses.

In this paper, we aim at synthesizing a free-viewpoint video of an arbitrary human performance using sparse multi-view cameras. Recently, several works have addressed this problem by learning person-specific radiance fields to capture the appearance of a particular human. In parallel, some work proposed to use pixel-aligned features to generalize radiance fields to arbitrary new scenes and objects at test time. Adopting such approaches to humans, however, is highly challenging due to the heavy occlusions and dynamic articulations of body parts. To tackle this, we propose Neural Human Performer, a novel approach that learns generalizable radiance fields based on a parametric human body model for robust performance capture. Specifically, we first introduce a temporal transformer that aggregates trackable visual features based on the skeletal body motions over video frames. Moreover, a multi-view transformer is proposed to perform cross-attention between the temporally-fused features and the pixel-aligned features at each time step to integrate observations on the fly from multiple views. Experiments on the ZJU-MoCap and AIST datasets show that our method significantly outperforms recent generalizable NeRF methods on unseen identities and poses.

Overview Video

we propose Neural Human Performer, a novel approach that learns generalizable radiance fields based on a parametric body model for robust performance capture. In addition to exploiting a parametric body model as a geometric prior, the core of our method is a combination of temporal and multi-view transformers which help to effectively aggregate spatio-temporal observations to robustly compute the density and color of a query point. First, the temporal transformer aggregates trackable visual features based on the input skeletal body motion over the video frames. The following multi-view transformer performs cross-attention between the temporally-augmented skeletal features and the pixel-aligned features from each time step. The proposed modules collectively contribute to the adaptive aggregation of multi-time and multi-view information, resulting in significant improvements in synthesis results in different generalization settings.

pipeline

Free-viewpoint rendering - Seen models' unseen poses

We compare our method with Neural Body [1] on seen model's unseen poses. Note that Neural body was per-subject optimized (one network for one person) while our model was optimized on all training subjects (one network for multiple people).

Free-viewpoint rendering - Unseen models' unseen poses

We compare our method with pixelNeRF [2] on unseen model's unseen poses.

3D Reconstruction - Unseen models' unseen poses

We compare our method with pixelNeRF [2] on unseen model's unseen poses.

Acknowledgements

We thank Sida Peng of Zhejiang University, Hangzhou, China, for very many helpful discussions on a variety of implementation details of Neural Body. We thank Ruilong li and Alex Yu of UC Berkeley for many discussions on the AIST++ dataset and pixelNeRF details. We also thank Alex Yu for the template of this website.

Please send any questions or comments to YoungJoong Kwon.