AvatarReX: Real-time Expressive Full-body Avatars (2305.04789v1)

Published 8 May 2023 in cs.CV and cs.GR

Abstract: We present AvatarReX, a new method for learning NeRF-based full-body avatars from video data. The learnt avatar not only provides expressive control of the body, hands and the face together, but also supports real-time animation and rendering. To this end, we propose a compositional avatar representation, where the body, hands and the face are separately modeled in a way that the structural prior from parametric mesh templates is properly utilized without compromising representation flexibility. Furthermore, we disentangle the geometry and appearance for each part. With these technical designs, we propose a dedicated deferred rendering pipeline, which can be executed in real-time framerate to synthesize high-quality free-view images. The disentanglement of geometry and appearance also allows us to design a two-pass training strategy that combines volume rendering and surface rendering for network training. In this way, patch-level supervision can be applied to force the network to learn sharp appearance details on the basis of geometry estimation. Overall, our method enables automatic construction of expressive full-body avatars with real-time rendering capability, and can generate photo-realistic images with dynamic details for novel body motions and facial expressions.

Citations (60)

View on Semantic Scholar

Summary

The paper introduces a compositional and disentangled representation that enables real-time rendering of expressive full-body avatars from video data.
It employs a novel two-pass training strategy combining volume and deferred surface rendering to refine geometry and appearance for photorealistic detail.
The system achieves 25 fps on consumer-grade hardware, offering robust control over body, hand, and facial expressions for interactive applications.

An Overview of AvatarReX: Real-time Expressive Full-body Avatars

The paper "AvatarReX: Real-time Expressive Full-body Avatars" presents a novel technique for learning real-time animatable avatars from video data using NeRF-based models. The authors, Zerong Zheng et al., focus on achieving expressive and controllable full-body avatars, combining advanced animation capabilities with the potential for real-time rendering.

The introduction of AvatarReX tackles two longstanding challenges in avatar modeling: achieving full expressivity across body, hand, and face and enabling real-time rendering, a requirement for interactive applications. Conventional methods rely heavily on complex capture setups and manual interventions inherent to traditional modeling pipelines, which include stages of scanning, meshing, and rigging, not to mention being computationally expensive.

AvatarReX distinguishes itself through its compositional representation approach. This method strategically separates body, hands, and face into distinct, implicit models, leveraging parametric mesh templates such as SMPL-X for bodies and MANO for hands. The separation allows robust avatar control while ensuring that the structural priors provided by these templates are integrated without limiting the flexibility of avatar representations.

The innovation does not stop at the separation; the authors introduce a way of disentangling geometry and appearance, promoting a more focused learning mechanism for each. This is achieved through a dedicated rendering pipeline that supports real-time performance. The adopted deferred rendering pipeline markedly accelerates the rendering process, synthesizing high-quality visuals at interactive framerates, a cornerstone capability for AR/VR applications.

A significant contribution is the two-pass training strategy combining deferred surface rendering with volume rendering. The first training phase lacks structural assumptions, incorporating volume rendering to sculpt the avatar's form. The second phase, crucially, harnesses patch-level perceptual losses to hammer out fine details through surface rendering. This two-step approach bridges the gap from geometry learning to appearance refinement, unveiling photo-realistic textures and nuances in expressions — outcomes crucial for overcoming the 'uncanny valley' effect.

The numerical experiments validate the theoretical contributions, demonstrating AvatarReX’s capability to produce near-photorealistic avatars operational at 25 frames per second at 1024x1024 resolution on consumer-grade hardware. The avatars exhibit dynamic realism in clothing wrinkles, subtle hand gestures, and facial expressions across custom-executed and arbitrarily manipulated motions. These experiments underline the system's suitability for interactive environments and hint at future applications in gaming, virtual meetings, and digital twin scenarios.

In terms of future implications, AvatarReX sets the stage for ongoing developments in avatar modeling by establishing a framework that efficiently balances computational demands with animation fidelity. The necessity for fewer cameras and less computational intensity broadens the technology's accessibility, fostering potential incremental improvements in capturing systems. Moreover, disentangling geometry and appearance offers a roadmap for handling more complex scenarios, such as multi-layered clothes or compositing with environmental lighting.

The limitations, such as visual artifacts at clothing boundaries or transitions driven by the simplifications of body movements, suggest avenues for further exploration. Enhancing the handling of dynamic illumination or integrating more sophisticated physical simulations could refine the realism further.

In conclusion, AvatarReX’s contribution lies in its innovative use of compositional and disentangled representations, yielding a robust system for real-time full-body avatar rendering. This work will likely influence the evolution of interactive media, providing a foundation and stimulus for further exploration in animated avatar technologies. The real-time synthesis of dynamic and expressive models eloquently speaks to the growing convergence of artificial intelligence with machine perception and human-centered computing.

PDF Markdown

Related Papers

YouTube

Show All Videos