1University of North Carolina at Chapel Hill 2Google Research, Brain Team 3Adobe Research
We present a method that enables synthesizing novel views and novel poses of arbitrary human performers from sparse multi-view images. A key ingredient of our method is a hybrid appearance blending module that combines the advantages of the implicit body NeRF representation and image-based rendering. Existing generalizable human NeRF methods that are conditioned on the body model have shown robustness against the geometric variation of arbitrary human performers. Yet they often exhibit blurry results when generalized onto unseen identities. Meanwhile, image-based rendering shows high-quality results when sufficient observations are available, whereas it suffers artifacts in sparse-view settings. We propose Neural Image-based Avatars (NIA) that exploits the best of those two methods: to maintain robustness under new articulations and self-occlusions while directly leveraging the available (sparse) source view colors to preserve appearance details of new subject identities. Our hybrid design outperforms recent methods on both in-domain identity generalization as well as challenging cross-dataset generalization settings. Also, in terms of the pose generalization, our method outperforms even the per-subject optimized animatable NeRF methods.
Given a sparse view images of an unseen person in the reference space, NIA instantly creates a human avatar for novel view synthesis and pose animation. NIA is a hybrid approach combining the SMPL-based implicit body representation and the image-based rendering method. Our appearance blender learns to adaptively blend the predictions from the two components. Note that if the reference space and the target pose spaces are identical (i.e., no pose difference, x = x', d = d'), the task is novel view synthesis. Otherwise, it is a pose animation task where we deform the NIA representation to repose the learned avatar.
We compare our method with IBRNet and NHP on the unseen subjects of ZJU-MoCap dataset.
Our method can create an avatar and animate it given the still multi-view images of the unseen subject. And our method even outperforms the per-subject optimized Animatable NeRF.
For the out-of-domain, cross-dataset generalization, we train a model on the ZJU-MoCap dataset, and test on the MonoCap dataset without any finetuning.
Given only three snaps of a new person, our NIA is able to perform plausible pose animation.
We thank Sida Peng of Zhejiang University for very many helpful discussions on a variety of implementation details of Animatable-NeRF. We thank Shih-Yang Su of University of British Columbia for helpful discussions on the A-NeRF details. We thank Prof. Helge Rhodin and his group for the insightful discussions on the human performance capture. This work was partially supported by National Science Foundation Award 2107454. We also thank Alex Yu for the template of this website.