Papers
Topics
Authors
Recent
Search
2000 character limit reached

PocketGS: On-Device 3D Gaussian Splatting

Updated 27 January 2026
  • PocketGS is a mobile-compatible 3D Gaussian Splatting system that delivers high-fidelity, real-time scene modeling using resource-efficient, on-device training.
  • It innovatively co-designs three operators: 𝒢 for geometry-prior construction, ℐ for anisotropic Gaussian initialization, and 𝒯 for hardware-aligned differentiable optimization to meet stringent mobile constraints.
  • Empirical evaluations demonstrate that PocketGS outperforms traditional workstation pipelines in perceptual quality and speed while maintaining peak memory usage below 3GB.

PocketGS is a fully on-device 3D Gaussian Splatting (3DGS) training system designed to enable high-fidelity, efficient 3D scene modeling directly on resource-constrained mobile devices such as smartphones. By jointly addressing the stringent requirements of minute-scale training budgets, strict peak-memory caps (below 3 GB), and hardware-accelerated differentiable optimization, PocketGS delivers real-time novel-view synthesis that matches or surpasses workstation-grade 3DGS pipelines in perceptual fidelity. The paradigm is underpinned by three co-designed operators: 𝒢 for geometry-prior construction, ℐ for prior-conditioned anisotropic Gaussian initialization, and 𝒯 for hardware-aligned differentiable optimization. This design enables capture-to-rendering workflows entirely on-device, as substantiated by empirical comparisons and ablation studies (Guo et al., 24 Jan 2026).

1. On-Device 3DGS: Motivations and Constraints

PocketGS directly targets the limitations of executing 3DGS training on mobile devices, which differ markedly from desktop environments where memory, compute, and time budgets are largely unconstrained. On devices such as the iPhone 15, PocketGS achieves:

  • End-to-end training within 5 minutes (500 iterations ≈ 4 minutes on Apple A16).
  • Peak memory usage under 3 GB, covering both geometry prior formation and parameter optimization.
  • Correct backpropagation of gradients solely through GPU-side operations, precluding expensive CPU-GPU synchronizations.

Naive application of desktop 3DGS fails under these constraints, confronting three primary contradictions:

  • Input-Recovery Contradiction: Mobile RGB-D scans yield noisy, sparse geometric inputs; naively densifying these within the training loop inflates computational and memory demands.
  • Initialization-Convergence Contradiction: Isotropic Gaussian seeding requires excessive iterations to organize primitives onto scene surfaces—unacceptable when runtime is tightly bounded.
  • Hardware-Differentiability Contradiction: Tile-based deferred rendering on mobile GPUs obscures blending state, thus hindering correct and efficient backpropagation without prohibitive overhead.

PocketGS resolves these issues through the co-design of 𝒢, ℐ, and 𝒯, aligning algorithmic and hardware constraints.

2. Operator 𝒢: Geometry-Prior Construction

Operator 𝒢 synthesizes a geometry-faithful, memory-efficient dense point cloud prior P\mathcal{P} to initialize the scene, comprising the following subsystems:

2.1 Information-Gated Frame Subsampling

To reduce processing load without sacrificing geometric diversity, PocketGS selects keyframes by:

  • Displacement gate: Admitting frames with translation changes d=tcurrtlast2τdd = \|t_{\text{curr}} - t_{\text{last}}\|_2 \geq \tau_d, where τd=5\tau_d = 5 cm.
  • Sharpness gate: Using approximate gradient energy S=(1/Ω)(x,y)Ω(I(x+Δ,y)I(x,y)+I(x,y+Δ)I(x,y))S = (1/|\Omega|) \sum_{(x,y) \in \Omega} (|I(x+\Delta, y) - I(x, y)| + |I(x, y+\Delta) - I(x, y)|).
  • Windowing: Within every 8-frame window, candidate frames only replace the current best if Snew>(1+r)SbestS_{\text{new}} > (1 + r) S_{\text{best}}, r=0.05r = 0.05.

This gating strictly bounds both structure-from-motion (BA) and multi-view stereo (MVS) workloads, conservatively admitting only 8–15 keyframes for typical captures.

2.2 GPU-Native Global Bundle Adjustment

Refinement of ARKit poses {T^t}\{\hat{T}_t\} and sparse points {Pj}\{P_j\} employs a robust, fully GPU-based Schur-complement solver to minimize the Huber-regularized reprojection loss:

min{Ti},{Pj}i,jρ(π(Ti,Pj)pijΣij2)\min_{\{T_i\}, \{P_j\}} \sum_{i, j} \rho\left(\|\pi(T_i, P_j) - p_{ij}\|^2_{\Sigma_{ij}}\right)

Here, π\pi denotes camera projection and ρ\rho is the Huber loss. The Hessian blocks are partitioned and inverted in parallel, avoiding CPU-GPU round-trips and yielding a clean, high-precision sparse point cloud.

2.3 Single-Reference Cost-Volume MVS

Dense reconstruction relies on a single-reference plane-sweep MVS, selecting the optimal reference frame to maximize:

$S_{\mathrm{ref}} = \exp\left(-\frac{(b - b_\mathrm{target})^2}{2\sigma_b^2}\right) \max\left(\frac{\alpha}{\alpha_\min}, 1\right)$

where bb is the baseline, α\alpha the viewing angle, and depths are sampled between the 5%5\%95%95\% quantiles of existing sparse depths. Census transform and Semi-Global Matching produce depth maps, which—conditioned on per-pixel confidence above $0.4$—are fused to form P\mathcal{P}.

3. Operator ℐ: Prior-Conditioned Gaussian Initialization

Operator ℐ addresses initialization-convergence inefficiency by directly embedding local surface statistics into each Gaussian's parameterization:

3.1 Local Covariance and Normal Estimation

For each point pip_i in P\mathcal{P}, the covariance CiC_i is computed from its K=16K=16 nearest neighbors:

Ci=1Kk(pkpˉi)(pkpˉi)TC_i = \frac{1}{K}\sum_{k} (p_k - \bar{p}_i)(p_k - \bar{p}_i)^T

The smallest-eigenvalue eigenvector nin_i yields the surface normal at pip_i.

3.2 Disc-Like Covariance Seeding

Tangential and normal scales are calculated as:

st=13k=13pipk,sn=rnormalst,rnormal=0.3s_t = \frac{1}{3}\sum_{k=1}^3 \|p_i - p_k\|, \qquad s_n = r_{\text{normal}} \cdot s_t, \qquad r_{\text{normal}} = 0.3

The Gaussian covariance is parameterized as:

Σi=Qidiag(st2,st2,sn2)QiT\Sigma_i = Q_i\,\mathrm{diag}(s_t^2,\,s_t^2,\,s_n^2)\,Q_i^T

where QiQ_i aligns the local zz-axis to nin_i. Opacity logits are initialized as logit(0.1)\mathrm{logit}(0.1). All scales are optimized in log space for numerical stability. This anisotropic, surface-aligned seeding dramatically reduces convergence time and conditions the model for rapid high-fidelity reconstruction.

4. Operator 𝒯: Hardware-Aligned Differentiable Splatting

Operator 𝒯 enables correct and efficient differentiable rendering on tile-based mobile GPUs:

4.1 Unrolled Alpha-Compositing with Forward Replay Cache

Manually unrolling the front-to-back alpha compositing equation,

Cout=Cin(1α)+αcC_{\text{out}} = C_{\text{in}} (1 - \alpha) + \alpha c

PocketGS stores a minimal replay cache S={Cin,α}S = \{C_{\text{in}}, \alpha\} per pixel and a counter buffer of O(WH)O(WH), permitting the correct gradient computation:

LCin=LCout(1α),Lα=LCout(cCin)\frac{\partial L}{\partial C_{\text{in}}} = \frac{\partial L}{\partial C_{\text{out}}} \cdot (1 - \alpha), \qquad \frac{\partial L}{\partial \alpha} = \frac{\partial L}{\partial C_{\text{out}}} \cdot (c - C_{\text{in}})

This obviates the need for full splat lists or framebuffer readbacks.

4.2 Index-Mapped Gradient Scattering

Rendering requires parameter vectors θ\theta to be arranged by depth. Gaussians are sorted on-GPU, producing an index mapping π\pi. Gradients are gathered in depth order during the forward pass and scattered back according to π\pi on the backward pass:

θπ(i)+=gi\nabla\theta_{\pi(i)} += g_i

Index mapping preserves optimizer state alignment and allows seamless backpropagation without CPU intervention.

4.3 On-GPU Adam Updates

All Adam optimizer moments and parameter updates execute within a single GPU kernel, eliminating host-device synchronization costs. Parameters are updated in logit space (opacity), log space (scales), and tangent-space (rotations), with all numerics in FP16 for stability.

5. End-to-End Mobile Capture-to-Rendering Workflow

The PocketGS workflow comprises:

  1. Capture \sim50 seconds of video with ARKit pose tracking.
  2. Information-gated keyframe selection (typically 8–15 frames).
  3. GPU-native BA to refine camera trajectories and initial sparse points.
  4. Single-reference MVS to produce a dense point cloud P\mathcal{P}.
  5. ℐ-based initialization to seed surface-aligned, anisotropic Gaussians (Θ0\Theta_0).
  6. Adam-optimized differentiable splatting (T\mathcal{T}) for 500 iterations on-device.
  7. Real-time rendering of the final Gaussian scene on the mobile device.

A companion mobile application implements the pipeline in Swift+Metal. All operations, including BA, MVS, initialization, and training, are fully on-device.

6. Experimental Results and Comparative Analysis

PocketGS is empirically benchmarked against two workstation 3DGS pipelines:

  • 3DGS-SFM-WK: COLMAP SfM sparse prior + vanilla 3DGS.
  • 3DGS-MVS-WK: COLMAP dense prior + vanilla 3DGS.

All systems are restricted to a 500-iteration budget and equivalent resolution. The following table summarizes performance on LLFF, NeRF-Synthetic, and MobileScan datasets (Guo et al., 24 Jan 2026):

Dataset Method PSNR SSIM↑ LPIPS↓ Time (s) #Gaussians
LLFF 3DGS-SFM-WK 21.01 0.641 0.405 108.0 18k
LLFF 3DGS-MVS-WK 19.53 0.637 0.387 313.1 40k
LLFF PocketGS 23.54 0.791 0.222 105.4 33k
NeRF-Syn 3DGS-SFM-WK 21.75 0.800 0.243 83.7 12k
NeRF-Syn 3DGS-MVS-WK 24.47 0.887 0.128 532.1 50k
NeRF-Syn PocketGS 24.32 0.858 0.144 101.4 47k
MobileScn 3DGS-SFM-WK 21.16 0.687 0.398 112.8 23k
MobileScn 3DGS-MVS-WK 20.85 0.781 0.281 534.5 165k
MobileScn PocketGS 23.67 0.791 0.225 255.2 168k

PocketGS satisfies memory constraints on MobileScan (geometry prior peak: 1.19–2.22 GB, full training peak: 1.82–2.65 GB, all below the 3 GB threshold).

7. Ablation Studies and Operator Contribution

Ablations on MobileScan validate the necessity of every PocketGS operator:

Variant PSNR↑ SSIM↑ LPIPS↓ Time (s)
Full PocketGS 23.67 0.791 0.225 255.2
w/o ℐ (anisotropic init) 22.49 0.770 0.253 319.5
w/o Global BA 23.45 0.752 0.232 251.1
w/o MVS 21.07 0.646 0.414 124.8

Key findings:

  • Removing operator ℐ (anisotropic initialization) forces isotropic seeds, resulting in a 1.2 dB PSNR decrease and 25% increased runtime.
  • Excluding global BA significantly degrades structural similarity (SSIM: 0.791 to 0.752).
  • Omitting MVS severely degrades reconstruction quality (PSNR: 23.67 to 21.07, LPIPS: 0.225 to 0.414).

This demonstrates that G\mathcal{G}'s lightweight MVS prior is critical for scene fidelity ceilings, I\mathcal{I} accelerates convergence, and T\mathcal{T} enables stable mobile-side differentiable training under memory and runtime constraints.


Through precise co-design of geometry prior construction, prior-conditioned initialization, and hardware-aligned optimization, PocketGS establishes a new paradigm for high-fidelity, efficient, fully on-device 3DGS training and rendering under stringent mobile hardware constraints (Guo et al., 24 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PocketGS.