Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniGS: Unified 3D Gaussian Splatting

Updated 27 April 2026
  • UniGS is a unified framework that employs anisotropic 3D Gaussian splatting to represent scenes and enable multi-modal image synthesis, reconstruction, retrieval, and segmentation.
  • It integrates Gaussian-aware dual-branch encoders, DETR-style multi-view updating, and differentiable rendering pipelines to push state-of-the-art benchmarks.
  • The framework leverages multi-modal contrastive, reconstruction, and perceptual losses, advancing empirical results in 3D scene understanding and generative segmentation.

UniGS encompasses a family of unified frameworks at the intersection of 3D Gaussian splatting, multi-modal pretraining, and generative segmentation, each leveraging 3D Gaussian representations as a principal abstraction for image synthesis, reconstruction, retrieval, and segmentation across diverse modalities. Recent works designated as "UniGS" converge on a set of core technical contributions: unified representation of scenes via anisotropic 3D Gaussians; differentiable renderers supporting multi-output modalities (RGB, depth, normal, semantics); and joint training protocols incorporating image, text, and 3D data streams. Across the literature, UniGS advances state-of-the-art benchmarks in 3D scene understanding, language-aligned retrieval, and segmentation, while introducing architectural innovations such as modular Gaussian-aware encoders, DETR-like updating strategies, and cross-modal alignment losses.

1. Unified 3D Gaussian Splatting Representation

At the core of all UniGS models is the representation of 3D scenes or entities via a set of NN anisotropic Gaussians,

G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)

where μi\mu_i denotes the Gaussian center in R3\mathbb R^3, cic_i the color coefficients (often via spherical harmonics), αi\alpha_i the opacity, si∈R+3s_i \in \mathbb R^3_+ the scaling parameters, and RiR_i a rotation quaternion. The covariance is parameterized as Σi=Ri diag(si)2 Ri⊤\Sigma_i = R_i\,\mathrm{diag}(s_i)^2\,R_i^\top. Radiance at a spatial query x\mathbf x is modeled by

G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)0

enabling smooth, differentiable rendering. Opacity-weighted splatting combines per-Gaussian contributions to obtain final pixel color: G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)1 This formulation underpins both multi-view synthesis (Wu et al., 2024), and unified scene rendering and segmentation (Li et al., 25 Feb 2025, Xie et al., 14 Oct 2025).

2. Model Architectures and Gaussian-Aware Guidance

Several architectural variants of UniGS adapt this representation to different learning tasks:

  • Gaussian-Aware Dual-Branch Encoders: In multi-modal pretraining (Li et al., 25 Feb 2025), UniGS incorporates two branches within the 3D encoder: a fundamental branch (G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)2) for geometry and color, and an advanced branch (G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)3) for shape, opacity, and pose. Cross-attention from G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)4 injects geometric priors into G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)5, producing aligned 512-dimensional embeddings for unified comparison with image and text features.
  • DETR-Style Multi-View Updaters: For novel-view scene reconstruction (Wu et al., 2024), UniGS models a single set of unitary world-space Gaussians and updates their parameters across G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)6 transformer layers with multi-view cross-attention (MVDFA) and spatially efficient inter-Gaussian self-attention (SESA). Each Gaussian is treated as a query, integrating multi-view evidence without redundant representation.
  • Differentiable Splatting Pipelines: For multimodal rendering (Xie et al., 14 Oct 2025), UniGS extends traditional splatting to simultaneous production of RGB, depth, normal, and semantic outputs via CUDA-accelerated routines. Depth is computed by analytic ray–ellipsoid intersection, with gradients backpropagated to all Gaussian attributes.

A modular learnable pruning attribute (G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)7 per Gaussian) supports differentiable filtering of non-contributive components—reducing the representation size by approximately 17% with negligible quality loss (Xie et al., 14 Oct 2025).

3. Learning Objectives and Cross-Modal Alignment

All UniGS models are trained with objectives integrating reconstruction, cross-modal alignment, and geometric consistency:

  • Multi-Modal Contrastive Losses: For language-image-3D alignment, UniGS uses (CLIP-style) symmetric contrastive losses between image, text, and 3DGS embeddings. For a batch of G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)8,

G={ gi}i=1N,gi=(μi, ci, αi, si, Ri)G = \{\,g_i\}_{i=1}^N,\quad g_i = (\mu_i,\,c_i,\,\alpha_i,\,s_i,\,R_i)9

and analogously for μi\mu_i0, with overall loss μi\mu_i1 (Li et al., 25 Feb 2025).

  • Reconstruction Losses: For rendering, pixel-wise μi\mu_i2 and SSIM are used for RGB, μi\mu_i3 for depth, cosine error for normals, and softmax cross-entropy for semantic logits (Xie et al., 14 Oct 2025).
  • Perceptual Objectives: In view-synthesis (Wu et al., 2024), UniGS combines MSE and LPIPS losses over rendered and ground-truth images.
  • Auxiliary Losses: Pruning agents and multimodal gradients are regularized, e.g., μi\mu_i4 for gradient-factor maps (Xie et al., 14 Oct 2025).

Curriculum or progressive task-scheduling strategies are often used, with easier tasks emphasized early in training (Ni et al., 28 Nov 2025).

4. Applications in 3D Scene Understanding, Synthesis, and Retrieval

UniGS demonstrates substantial empirical gains across several benchmarks:

  • Text-Driven 3D Retrieval: On Objaverse, UniGS achieves 21.0% Top-1 accuracy vs. 10.4% for Uni3D; on ABO, 11.27% vs. 8.07%; and on MVImgNet, 18.65% vs. 9.09% (Li et al., 25 Feb 2025).
  • Zero-Shot 3D Classification: UniGS outperforms Uni3D by +8.10 percentage points Top-1 on Objaverse, +9.18 on ABO, and +7.92 on SUN RGB-D scene recognition tasks (Li et al., 25 Feb 2025).
  • Novel View Synthesis: By modeling a unitary set of scene Gaussians and jointly updating via multi-view cross-attention, UniGS attains PSNR = 30.42 dB (↑ 4.2 dB over prior SOTA) and LPIPS = 0.042 on the Objaverse→GSO benchmark (Wu et al., 2024).
  • Multimodal Rendering: UniGS enables real-time rendering (~170 FPS) of RGB, depth, normals, and semantic maps, with leading accuracy: PSNR = 30.1 dB, AbsRel depth = 0.0062, normal cosine = 0.888, semantic mIoU = 0.987 (Xie et al., 14 Oct 2025).

The frameworks sustain their accuracy for arbitrary input view counts without memory explosion and provide fine-grained scene reconstructions with fewer artifacts such as ghosting.

5. Extensions to Generative Segmentation

A concurrent strand adapts "UniGS" to unified image generation and entity segmentation within latent diffusion models (Qi et al., 2023). Here, instance masks are encoded as colormaps using a location-aware palette, enabling simultaneous image and instance-mask generation from a single model. The progressive dichotomy module decodes colormaps to varying numbers of entity masks via recursive 2-means clustering in feature space, obviating the need to specify the number of instances. Ablations confirm that the location-aware palette and recursive clustering outperform random or fixed-μi\mu_i5 methods in mIoU (0.808 vs. 0.493/0.520).

Empirical results show that UniGS diffusion models achieve FID = 15.27 and mIoU = 0.781 for single-object image synthesis, outperforming Stable Diffusion and ControlNet baselines. In referring segmentation, UniGS approaches the accuracy of specialized segmenters (Mask2Former mIoU = 0.815, UniGS mIoU = 0.808).

6. Limitations and Directions for Future Research

Several limitations are shared across UniGS modalities:

A plausible implication is that ongoing development may focus on scaling memory bank capacity, multi-sensor integration, and tighter language-3D alignment using LLMs and external knowledge sources.

7. Comparative Summary

The following table provides a high-level overview of prominent UniGS instantiations:

Paper Title & arXiv ID Core Modality Key Innovations
"Unified Language-Image-3D Pretraining ..." (Li et al., 25 Feb 2025) Vision-language-3D 3DGS + CLIP, dual-branch guidance, multi-modal pretraining
"Unified Geometry-Aware Gaussian Splatting ..." (Xie et al., 14 Oct 2025) Multi-modal 3D recon RGB/depth/normal/semantic output, analytic gradients, pruning
"Modeling Unitary 3D Gaussians ..." (Wu et al., 2024) Sparse-view NVS DETR-like multi-view update, ghosting elimination
"Unified Representation for Image Generation..." (Qi et al., 2023) Gen. segmentation Latent diffusion, colormaps, progressive dichotomy module
"UniGeoSeg: Unified Open-World Segmentation..." (Ni et al., 28 Nov 2025) RS segmentation Unified instruction-driven segmentation, LLM-guided pipeline

Each representation advances the state of the art in its respective setting, with consistent evidence of strong alignment to both metric and qualitative benchmarks. As architectural and training innovations diffuse through the field, UniGS is positioned as a canonical abstraction for unified, differentiable scene understanding across modalities.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UniGS.