Interpretability of neural radiance fields encoded in network weights

Develop principled methods to analyze and interpret neural radiance fields—continuous 5D functions represented by multilayer perceptrons that map spatial position and viewing direction to volume density and view-dependent emitted radiance—in order to reason about the expected quality of rendered views and identify failure modes, given that scene content is encoded in network weights rather than explicit sampled representations such as voxel grids or meshes.

Background

NeRF represents scenes as continuous 5D functions encoded in the parameters of a fully-connected neural network (an MLP) that outputs volume density and view-dependent color for any spatial coordinate and viewing direction. This approach departs from traditional explicit sampled representations like voxel grids or triangle meshes, which offer direct geometric and appearance primitives that can be inspected and reasoned about.

While explicit sampled representations admit established ways to assess rendering quality and diagnose failure modes, encoding the entire scene within network weights complicates interpretability. The authors emphasize that, despite NeRF’s strong empirical performance, analytic tools for understanding when and why renderings succeed or fail, and for predicting view quality a priori, are lacking for weight-encoded scene models.

References

Another direction for future work is interpretability: sampled representations such as voxel grids and meshes admit reasoning about the expected quality of rendered views and failure modes, but it is unclear how to analyze these issues when we encode scenes in the weights of a deep neural network.

NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis  (2003.08934 - Mildenhall et al., 2020) in Conclusion