Editable Free-viewpoint Video Using a Layered Neural Representation (2104.14786v1)

Published 30 Apr 2021 in cs.CV and cs.GR

Abstract: Generating free-viewpoint videos is critical for immersive VR/AR experience but recent neural advances still lack the editing ability to manipulate the visual perception for large dynamic scenes. To fill this gap, in this paper we propose the first approach for editable photo-realistic free-viewpoint video generation for large-scale dynamic scenes using only sparse 16 cameras. The core of our approach is a new layered neural representation, where each dynamic entity including the environment itself is formulated into a space-time coherent neural layered radiance representation called ST-NeRF. Such layered representation supports fully perception and realistic manipulation of the dynamic scene whilst still supporting a free viewing experience in a wide range. In our ST-NeRF, the dynamic entity/layer is represented as continuous functions, which achieves the disentanglement of location, deformation as well as the appearance of the dynamic entity in a continuous and self-supervised manner. We propose a scene parsing 4D label map tracking to disentangle the spatial information explicitly, and a continuous deform module to disentangle the temporal motion implicitly. An object-aware volume rendering scheme is further introduced for the re-assembling of all the neural layers. We adopt a novel layered loss and motion-aware ray sampling strategy to enable efficient training for a large dynamic scene with multiple performers, Our framework further enables a variety of editing functions, i.e., manipulating the scale and location, duplicating or retiming individual neural layers to create numerous visual effects while preserving high realism. Extensive experiments demonstrate the effectiveness of our approach to achieve high-quality, photo-realistic, and editable free-viewpoint video generation for dynamic scenes.

Authors (9)

Jiakai Zhang (17 papers)
Xinhang Liu (17 papers)
Xinyi Ye (14 papers)
Fuqiang Zhao (15 papers)
Yanshun Zhang (3 papers)
Minye Wu (31 papers)
Yingliang Zhang (17 papers)
Lan Xu (102 papers)
Jingyi Yu (172 papers)

Citations (164)

View on Semantic Scholar

Summary

The paper introduces ST-NeRF that disentangles spatial and temporal details to enable dynamic, editable free-viewpoint video synthesis.
It employs a space-time deform module and a neural radiance module to convert sparse 16-camera inputs into photorealistic renderings.
The approach supports object-aware rendering and direct neural scene editing, offering a cost-effective solution for dynamic content production.

Editable Free-Viewpoint Video using a Layered Neural Representation

The paper presents a novel approach to generating editable free-viewpoint video for large-scale view-dependent dynamic scenes using a sparse setup of 16 cameras. The primary innovation in this work is the introduction of a layered neural representation, termed ST-NeRF (spatio-temporal neural radiance field), which models each dynamic entity—including the environment itself—as a separate neural layer capable of supporting spatial and temporal edits while preserving photorealistic rendering.

Layered Neural Representation

The core of the authors' approach is a neural radiance field that disentangles spatial and temporal information for dynamic entities in a scene. Each entity is described using a continuous function that accounts for its location, deformation, and appearance over time. This is achieved through two modules: a space-time deform module and a neural radiance module. The deform module encodes temporal motion, allowing points sampled from various times and viewpoints to be deformed into a canonical space, while the radiance module records geometry and color, facilitating view-dependent effects across complex dynamic scenes.

Scene Composition and Object-Aware Rendering

A unique aspect of this approach is the object-aware volume rendering technique, which enables the independent manipulation and seamless composition of different neural layers. This rendering strategy divides scene components into discrete segments, allowing for accurate reconstruction and rendering of both occluded and visible layers with realistic blending. This makes it possible to generate free-viewpoint videos with editable components—a capability unavailable in traditional image-based rendering approaches.

Neural Scene Editing

The rich feature set of neural editing enabled by ST-NeRF includes spatial transformation, temporal retiming, object insertion and removal, and transparency adjustment. This allows not only for visual effects such as movement and duplication but also depth-aware rendering for realistic visual outputs. These edits are achieved through direct manipulation of a layer's spatial position and timing during inference, without the need for additional processing or data outside of the initial capture set.

Results and Implications

The results demonstrate the effectiveness of this methodology in producing high-quality, photo-realistic, editable free-viewpoint videos with applications ranging from VR/AR experiences to entertainment and gaming. The ability to produce editable scenes from a sparse setup indicates significant potential for shifts in production paradigms, leaning towards less resource-intensive setups without compromising quality. Furthermore, these techniques support innovations in how dynamic scenes undergo post-production edits, presenting opportunities for innovation in visual media and content creation.

Future Directions

Future research may address some current limitations, such as extending applicability across a wider range of dynamic entities and improving appearance modeling for scenarios affected by extreme lighting conditions. Additionally, further reduction in required camera arrays might be possible through advancements in neural rendering algorithms that better interpolate missing data or through the integration of pre-scanned environmental contexts as initial proxies. Exploring non-rigid manipulation of entities, as well as leveraging advanced human motion models such as SMPL, could also enhance the technique’s versatility.

In conclusion, the effectiveness and potential practical applications of layered neural representations in free-viewpoint video production demonstrate substantial progress in neural rendering technology, promising transformative impacts on dynamic scene modeling and interactive media creation.

PDF Markdown