ShelfGaussian: 3D Scene Understanding

Updated 6 December 2025

ShelfGaussian is a Gaussian-based framework that represents 3D scenes as a sparse set of Gaussians, integrating cameras, LiDAR, and radar data.
It employs a unified transformer architecture with deformable attention to fuse multi-modal features for zero-shot semantic occupancy prediction.
The shelf-supervised paradigm leverages off-the-shelf vision models and pseudo-labels to enable efficient, real-time occupancy inference and trajectory planning.

ShelfGaussian is an open-vocabulary, multi-modal, Gaussian-based 3D scene understanding framework that advances the representation of three-dimensional spatial data by employing Gaussians as atomic scene primitives jointly supervised by off-the-shelf vision foundation models (VFMs). It integrates sensor modalities such as cameras, LiDAR, and radar within a unified transformer-based architecture, enabling state-of-the-art zero-shot semantic occupancy prediction and robust planning for autonomous agents, including in-the-wild field evaluation on unmanned ground vehicles (Zhao et al., 3 Dec 2025). ShelfGaussian also connects to the broader family of shell-Gaussian (or shell–Gaussian) functions, Ω_N(x; μ, ν), which exhibit Gaussian-convolution-invariance and are instrumental in modeling spherically symmetric scalar fields (Urzhumtsev et al., 18 Dec 2024).

1. Gaussian-Based 3D Scene Representation

ShelfGaussian represents a 3D scene as a sparse set of Gaussians:

$G = \{ G_i \}_{i=1}^{N_n},\ \ \ G_i = (\mu_i, \Sigma_i, \alpha_i, f_i),$

where:

$\mu_i \in \mathbb{R}^3$ : mean (scene location, ego-vehicle frame),
$\Sigma_i \in \mathbb{R}^{3 \times 3}$ : covariance (encoded via rotation quaternion $r_i$ and scale $s_i$ ),
$\alpha_i \in [0,1]$ : opacity,
$f_i \in \mathbb{R}^{C_f}$ : feature vector.

Initialization involves projecting pixels with depth into 3D and assigning orientation and scale:

$m_c^{3D} = (u, v, d)^\top, \; m_{ego} = E_{e \rightarrow c}^{-1} K^{-1} m_c^{3D},$

coupled with an orientation $r_i$ orthogonal to the view and scale $s_i \propto d$ (Zhao et al., 3 Dec 2025).

Rendering proceeds by sorting Gaussians along a given camera ray and applying alpha-blending:

$\bar F(p) = \sum_{i=1}^N \bar f_i \alpha_i \prod_{j < i} (1 - \alpha_j), \quad \bar D(p) = \sum_{i=1}^N d_i \alpha_i \prod_{j < i} (1 - \alpha_j),$

enabling efficient composition of 2D projections from sparsely parameterized 3D scenes.

ShelfGaussian applies a multi-modal transformer whose learnable Gaussian "queries" are associated with 3D points and designed to interact with features from different sensor modalities:

Camera: DINO backbone plus FPN provides multi-scale image features.
LiDAR: SECOND + FPN yields BEV features.
Radar: PointPillars outputs BEV features.

Cross-modal attention is performed via DeformAttn modules, extracting modality-specific features:

$Q_c = \text{DeformAttn}(F_c, Q, M_c^{3D}), \; Q_l = \text{DeformAttn}(F_l, Q, M_l^{3D}), \; Q_r = \text{DeformAttn}(F_r, Q, M_l^{3D}),$

with modality fusion via concatenation and MLP. Position encoding, self-attention, and feed-forward processing yield updated queries and parameter refinements for the Gaussians. This transformer design allows ShelfGaussian to exploit complementary sensor sources and update scene parameters in a self-consistent manner (Zhao et al., 3 Dec 2025).

3. Shelf-Supervised Learning Paradigm

ShelfGaussian employs a "shelf-supervised" paradigm where 2D vision foundation models and 3D pseudo-labels provide supervision without the need for hand-annotated 3D semantic ground truth. The overall loss consists of 2D image-level and 3D scene-level terms:

$L = L_\mathrm{feat}^{2D} + L_\mathrm{depth}^{2D} + L_\mathrm{SILog}^{2D} + L_\mathrm{bce}^{3D} + L_\mathrm{feat}^{3D}$

2D losses: Compare rendered depth and compressed feature maps with VFM targets using $L_1$ , SILog, and cosine similarity metrics.
3D losses: Pseudo-labels for occupancy and features are derived by projecting LiDAR into camera images and aggregating DINO features, followed by voxelization. CUDA-accelerated Gaussian-to-Voxel (G2V) splatting enables efficient forward and backward passes.

Per-voxel predictions (occupancy density and features) are supervised using BCE and cosine losses, gated by mask intersections of predicted, pseudo, and visibility labels (Zhao et al., 3 Dec 2025).

4. Evaluation and Empirical Results

ShelfGaussian demonstrates state-of-the-art performance across a range of 3D scene understanding tasks:

Modality	IoU (%)	mIoU (%)
Camera only	63.25	19.07
LiDAR + Camera	69.24	21.52
LiDAR + Camera + Radar	69.45	21.78

For zero-shot BEV segmentation without ground-truth BEV labels or finetuning:

ShelfGaussian-C: 21.1% IoU (vehicle), 6.2% IoU (pedestrian), 38.4% IoU (drivable area).

In trajectory planning, the "Gaussian-Planner" produces a mean open-loop trajectory error (Avg L2) of 0.35 m, collision rate 0.32% (BEV-Planner: 0.35 m / 0.42%). Field tests with an unmanned ground vehicle confirm successful occupancy inference on novel, open-vocabulary queries (e.g., "flower shrubs", "stop sign") in diverse real-world scenarios (Zhao et al., 3 Dec 2025).

5. Implementation Details and Ablations

ShelfGaussian backbone choices include DINOv2 ViT-B/14 (378×672) or DINOv3 ViT-B/16 (432×768) with PCA feature compression (1024 → 128 dimensions), 1000 Gaussians per view, and embedding dimension $C_g = 256$ . Key findings include:

G2V splatting runs at 0.5 s forward / 2.7 s backward for 18k Gaussians with 1024D features, using 4.9 GB GPU memory (versus GaussTR at 14.1 s/596.7 s/22.3 GB).
Performance improves with increased number of Gaussians (300 → 1000) and with joint 2D+3D training (2D-only IoU=47.26%, 3D-only≈58.6%, joint=63.25%).
DINOv3 yields mixed results depending on agent size.
3D loss upweighting boosts results.

ShelfGaussian supports real-time inference (1000 Gaussians × 256-dim features per view), is modality-agnostic, and releases code and data for reproducibility (Zhao et al., 3 Dec 2025).

6. Mathematical Foundations: Shell-Gaussian Functions

The shell–Gaussian or "shell-Gaussian" function, denoted $\Omega_N(x; \mu, \nu)$ , is defined in $\mathbb{R}^N$ as the convolution of the normalized uniform measure on the $N$ -sphere of radius $\mu$ with an isotropic Gaussian of variance $\nu$ (Urzhumtsev et al., 18 Dec 2024):

$\Omega_N(x; \mu, \nu) = \int_{|y| = \mu} \frac{dS_N(y)}{S_N(\mu)} \; g_N(x-y; \nu),$

where $g_N(x; \nu)$ is the $N$ -dimensional normalized Gaussian, and $S_N(\mu)$ is the surface area of the $N$ -sphere.

Key properties include:

Gaussian-convolution-invariance: $\Omega_N(x; \mu, \nu) * g_N(x; \sigma) = \Omega_N(x; \mu, \nu+\sigma)$ .
Explicit formulas for $N = 1, 2, 3$ (Bessel and hyperbolic functions).
Normalization: $\int_{\mathbb{R}^N}\Omega_N(x;\mu,\nu)d^N x = 1$ .
Scaling: For $\alpha > 0$ , $\Omega_N(x; \mu, \nu) = \alpha^N \Omega_N(\alpha x; \alpha \mu; \alpha^2\nu)$ .

These functions are instrumental for efficient approximation of spherically-symmetric, oscillatory fields and facilitate convolutional workflows such as modeling atomic contributions or electron density in crystallography.

7. Context, Applications, and Future Directions

ShelfGaussian represents a convergence of theoretical advances in Gaussian-based function approximation and practical multi-modal scene understanding with open-vocabulary supervision. Its framework accommodates flexible sensor input, scales efficiently, and supports gradient-based refinement through analytic derivatives. Key application domains include:

Autonomous agent perception and planning (robotics, self-driving),
Semantic scene understanding under open-set and zero-shot conditions,
Real-time fusion of camera, LiDAR, and radar modalities.

A plausible implication is that the synergy between Gaussian primitives and shelf-supervision from foundation models may further generalize to 4D spatio-temporal or open-vocabulary affordance modeling, given ongoing developments in multi-modal transformers and Gaussian-based function spaces (Zhao et al., 3 Dec 2025, Urzhumtsev et al., 18 Dec 2024).