Gaussian Masked Autoencoders (2501.03229v1)

Published 6 Jan 2025 in cs.CV and cs.AI

Abstract: This paper explores Masked Autoencoders (MAE) with Gaussian Splatting. While reconstructive self-supervised learning frameworks such as MAE learns good semantic abstractions, it is not trained for explicit spatial awareness. Our approach, named Gaussian Masked Autoencoder, or GMAE, aims to learn semantic abstractions and spatial understanding jointly. Like MAE, it reconstructs the image end-to-end in the pixel space, but beyond MAE, it also introduces an intermediate, 3D Gaussian-based representation and renders images via splatting. We show that GMAE can enable various zero-shot learning capabilities of spatial understanding (e.g., figure-ground segmentation, image layering, edge detection, etc.) while preserving the high-level semantics of self-supervised representation quality from MAE. To our knowledge, we are the first to employ Gaussian primitives in an image representation learning framework beyond optimization-based single-scene reconstructions. We believe GMAE will inspire further research in this direction and contribute to developing next-generation techniques for modeling high-fidelity visual data. More details at https://brjathu.github.io/gmae

Authors (6)

Jathushan Rajasegaran (26 papers)
Xinlei Chen (106 papers)
Rulilong Li (1 paper)
Christoph Feichtenhofer (52 papers)
Jitendra Malik (211 papers)
Shiry Ginosar (16 papers)

Summary

Gaussian Masked Autoencoders

The paper presents an innovative approach to image representation learning through a novel method termed Gaussian Masked Autoencoders (GMAE). This research advances the reconstructive capabilities of existing Masked Autoencoders (MAE), integrating a spatial awareness component via Gaussian splatting within self-supervised frameworks. The primary objective is to achieve concurrent learning of semantic abstractions and spatial understanding, an area where traditional MAE tends to fall short.

The GMAE methodology operates by employing a 3D Gaussian-based intermediate representation, extending upon the pixel-space reconstruction basis of MAE. Unlike conventional methods, GMAE introduces Gaussian primitives directly under the guidance of a differentiable rendering process, addressing the limitations of MAE in explicit spatial representation.

Quantitatively, the research demonstrates that GMAE maintains competitive performance in supervised tasks such as image classification and object detection on benchmarks like ImageNet and COCO. Notably, the method presents an enhancement over MAE in reconstruction fidelity, facilitated by the non-uniform nature of Gaussian representations that optimally distribute across the image space to capture high-frequency details.

The exploration of layers reveals a hierarchical representation, where GMAE performs zero-shot tasks like figure-ground segmentation and edge detection without fine-tuning. Unlike patch-based approaches that uniformly sample the image space, GMAE dynamically allocates Gaussian densities in response to the semantic content of regions within the image, allowing more effective modeling of complex scenarios.

The experimental results substantiate the efficiency of GMAE, which augments MAE's capabilities while introducing a negligible computational overhead—only a 1.5% increase. Importantly, the Gaussian primitives' ability to adapt their size and distribution based on image content unveils new pathways for intuitive spatial reasoning tasks, functioning robustly even in zero-shot environments.

Theoretically, this research lays the groundwork for a new generation of high-fidelity visual modeling techniques. The flexibility in adopting a Gaussian-based intermediate representation exemplifies an important stride toward bridging low-level pixel data with high-level semantic abstractions. From a practical perspective, the ability of GMAE to perform spatial reasoning tasks opens up opportunities for applications in fields requiring robust scene understanding without reliance on extensive labeled data. Future developments could further refine this approach, potentially integrating more advanced Gaussian rendering techniques and scale optimization to expand its utility across a broader spectrum of complex visual datasets and tasks.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GMAE

Tweets

https://twitter.com/shiryginosar/status/1877468541982785657

https://twitter.com/janusch_patas/status/1876498261537947693

https://twitter.com/gm8xx8/status/1876502282122248292