Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations (2111.13152v3)

Published 25 Nov 2021 in cs.CV, cs.AI, cs.GR, cs.LG, and cs.RO

Abstract: A classical problem in computer vision is to infer a 3D scene representation from few images that can be used to render novel views at interactive rates. Previous work focuses on reconstructing pre-defined 3D representations, e.g. textured meshes, or implicit representations, e.g. radiance fields, and often requires input images with precise camera poses and long processing times for each novel scene. In this work, we propose the Scene Representation Transformer (SRT), a method which processes posed or unposed RGB images of a new area, infers a "set-latent scene representation", and synthesises novel views, all in a single feed-forward pass. To calculate the scene representation, we propose a generalization of the Vision Transformer to sets of images, enabling global information integration, and hence 3D reasoning. An efficient decoder transformer parameterizes the light field by attending into the scene representation to render novel views. Learning is supervised end-to-end by minimizing a novel-view reconstruction error. We show that this method outperforms recent baselines in terms of PSNR and speed on synthetic datasets, including a new dataset created for the paper. Further, we demonstrate that SRT scales to support interactive visualization and semantic segmentation of real-world outdoor environments using Street View imagery.

Authors (13)

Mehdi S. M. Sajjadi (28 papers)
Henning Meyer (3 papers)
Etienne Pot (9 papers)
Urs Bergmann (17 papers)
Klaus Greff (32 papers)
Noha Radwan (10 papers)
Suhani Vora (5 papers)
Daniel Duckworth (20 papers)
Alexey Dosovitskiy (49 papers)
Jakob Uszkoreit (23 papers)
Thomas Funkhouser (66 papers)
Andrea Tagliasacchi (78 papers)
Mario Lucic (42 papers)

Citations (159)

View on Semantic Scholar

Summary

The paper introduces a geometry-free approach for novel view synthesis by leveraging set-latent scene representations with transformer attention.
It employs an encoder-decoder architecture where CNN feature extraction and transformer networks replace explicit geometric computations.
Empirical results show improved PSNR and rapid rendering in synthetic and real-world settings, enabling interactive 3D visualization.

Advancements in Novel View Synthesis: Insights from "Scene Representation Transformer"

Introduction

In the domain of computer vision and 3D scene representation, a recent methodological proposition titled "Scene Representation Transformer" (SRT) signals a tangible advancement in the technology surrounding novel view synthesis. This method, devoid of conventional geometric constraints, introduces a transformative approach towards synthesizing novel views through set-latent scene representations effectively. By leveraging transformer models, the SRT method processes RGB images—unbound by the need for predetermined poses—ultimately rendering new views efficiently. Such an approach not only broadens the scope of interactive visualization but also demonstrates significant improvements over existing baselines in terms of scalability and speed, especially evident across synthetic datasets and real-world imagery alike.

Methodology

Diving deeper into the mechanics, SRT is rooted in the encoder-decoder framework, employing transformers in both capacities. The initial phase involves a convolutional neural network (CNN) that distills images into patch features, succeeded by an encoder transformer that integrates these features into a comprehensive set-latent scene representation. This representation serves as the foundation for novel view synthesis, with the decoder transformer taking charge of rendering new views by attending to relevant sections of the latent space. The critical departure from traditional methods lies in SRT’s reliance on learned attention mechanisms over explicit geometric computations, allowing for end-to-end learning directly from image data.

Performance and Evaluations

Empirical evidence from multiple datasets underscores SRT’s superiority. When set against recent models, SRT showcased remarkable performance, achieving higher peak signal-to-noise ratios (PSNR) and faster rendering times. Notably, in synthetic environments and demanding real-world scenarios, it demonstrated both the ability to handle complex scene geometries and resilience against varying camera pose accuracy. Moreover, its capability to operate without direct camera pose information at inference stages unveils new possibilities for applications with sparse or imprecise view data. In practical terms, the efficiency of SRT translates to real-time performance improvements, making it an invaluable asset for tasks requiring swift novel view generation.

Theoretical Implications and Future Directions

On a theoretical level, the exploration into geometry-free scene representation and synthesis elucidates the potential of transformers for 3D reasoning in the visual domain. Such advancements stimulate further inquiry into the limitations and scalability of neural scene representations, potentially guiding future research towards more generalized and efficient models. Additionally, the demonstrated efficacy of set-latent representations opens avenues for expanded applications in virtual reality, augmented reality, and beyond, where dynamic view generation is paramount.

Conclusion

The "Scene Representation Transformer" heralds a significant leap forward, offering a versatile and efficient method for novel view synthesis without the crutches of pre-defined geometries or exhaustive camera pose requirements. As it sets new benchmarks in terms of speed and scalability, the broader implications for both academic research and practical applications loom large, promising invigorating explorations in the visualization of complex scenes and interactive 3D environments.

PDF Markdown

Related Papers

Tweets

https://twitter.com/tkipf/status/1788508336012382344