PocketGS: On-Device 3D Gaussian Splatting

Updated 27 January 2026

PocketGS is a mobile-compatible 3D Gaussian Splatting system that delivers high-fidelity, real-time scene modeling using resource-efficient, on-device training.
It innovatively co-designs three operators: 𝒢 for geometry-prior construction, ℐ for anisotropic Gaussian initialization, and 𝒯 for hardware-aligned differentiable optimization to meet stringent mobile constraints.
Empirical evaluations demonstrate that PocketGS outperforms traditional workstation pipelines in perceptual quality and speed while maintaining peak memory usage below 3GB.

PocketGS is a fully on-device 3D Gaussian Splatting (3DGS) training system designed to enable high-fidelity, efficient 3D scene modeling directly on resource-constrained mobile devices such as smartphones. By jointly addressing the stringent requirements of minute-scale training budgets, strict peak-memory caps (below 3 GB), and hardware-accelerated differentiable optimization, PocketGS delivers real-time novel-view synthesis that matches or surpasses workstation-grade 3DGS pipelines in perceptual fidelity. The paradigm is underpinned by three co-designed operators: 𝒢 for geometry-prior construction, ℐ for prior-conditioned anisotropic Gaussian initialization, and 𝒯 for hardware-aligned differentiable optimization. This design enables capture-to-rendering workflows entirely on-device, as substantiated by empirical comparisons and ablation studies (Guo et al., 24 Jan 2026).

1. On-Device 3DGS: Motivations and Constraints

PocketGS directly targets the limitations of executing 3DGS training on mobile devices, which differ markedly from desktop environments where memory, compute, and time budgets are largely unconstrained. On devices such as the iPhone 15, PocketGS achieves:

End-to-end training within 5 minutes (500 iterations ≈ 4 minutes on Apple A16).
Peak memory usage under 3 GB, covering both geometry prior formation and parameter optimization.
Correct backpropagation of gradients solely through GPU-side operations, precluding expensive CPU-GPU synchronizations.

Naive application of desktop 3DGS fails under these constraints, confronting three primary contradictions:

Input-Recovery Contradiction: Mobile RGB-D scans yield noisy, sparse geometric inputs; naively densifying these within the training loop inflates computational and memory demands.
Initialization-Convergence Contradiction: Isotropic Gaussian seeding requires excessive iterations to organize primitives onto scene surfaces—unacceptable when runtime is tightly bounded.
Hardware-Differentiability Contradiction: Tile-based deferred rendering on mobile GPUs obscures blending state, thus hindering correct and efficient backpropagation without prohibitive overhead.

PocketGS resolves these issues through the co-design of 𝒢, ℐ, and 𝒯, aligning algorithmic and hardware constraints.

2. Operator 𝒢: Geometry-Prior Construction

Operator 𝒢 synthesizes a geometry-faithful, memory-efficient dense point cloud prior $\mathcal{P}$ to initialize the scene, comprising the following subsystems:

2.1 Information-Gated Frame Subsampling

To reduce processing load without sacrificing geometric diversity, PocketGS selects keyframes by:

Displacement gate: Admitting frames with translation changes $d = \|t_{\text{curr}} - t_{\text{last}}\|_2 \geq \tau_d$ , where $\tau_d = 5$ cm.
Sharpness gate: Using approximate gradient energy $S = (1/|\Omega|) \sum_{(x,y) \in \Omega} (|I(x+\Delta, y) - I(x, y)| + |I(x, y+\Delta) - I(x, y)|)$ .
Windowing: Within every 8-frame window, candidate frames only replace the current best if $S_{\text{new}} > (1 + r) S_{\text{best}}$ , $r = 0.05$ .

This gating strictly bounds both structure-from-motion (BA) and multi-view stereo (MVS) workloads, conservatively admitting only 8–15 keyframes for typical captures.

2.2 GPU-Native Global Bundle Adjustment

Refinement of ARKit poses $\{\hat{T}_t\}$ and sparse points $\{P_j\}$ employs a robust, fully GPU-based Schur-complement solver to minimize the Huber-regularized reprojection loss:

$\min_{\{T_i\}, \{P_j\}} \sum_{i, j} \rho\left(\|\pi(T_i, P_j) - p_{ij}\|^2_{\Sigma_{ij}}\right)$

Here, $\pi$ denotes camera projection and $\rho$ is the Huber loss. The Hessian blocks are partitioned and inverted in parallel, avoiding CPU-GPU round-trips and yielding a clean, high-precision sparse point cloud.

2.3 Single-Reference Cost-Volume MVS

Dense reconstruction relies on a single-reference plane-sweep MVS, selecting the optimal reference frame to maximize:

$S_{\mathrm{ref}} = \exp\left(-\frac{(b - b_\mathrm{target})^2}{2\sigma_b^2}\right) \max\left(\frac{\alpha}{\alpha_\min}, 1\right)$

where $b$ is the baseline, $\alpha$ the viewing angle, and depths are sampled between the $5\%$ – $95\%$ quantiles of existing sparse depths. Census transform and Semi-Global Matching produce depth maps, which—conditioned on per-pixel confidence above $0.4$—are fused to form $\mathcal{P}$ .

3. Operator ℐ: Prior-Conditioned Gaussian Initialization

Operator ℐ addresses initialization-convergence inefficiency by directly embedding local surface statistics into each Gaussian's parameterization:

3.1 Local Covariance and Normal Estimation

For each point $p_i$ in $\mathcal{P}$ , the covariance $C_i$ is computed from its $K=16$ nearest neighbors:

$C_i = \frac{1}{K}\sum_{k} (p_k - \bar{p}_i)(p_k - \bar{p}_i)^T$

The smallest-eigenvalue eigenvector $n_i$ yields the surface normal at $p_i$ .

3.2 Disc-Like Covariance Seeding

Tangential and normal scales are calculated as:

$s_t = \frac{1}{3}\sum_{k=1}^3 \|p_i - p_k\|, \qquad s_n = r_{\text{normal}} \cdot s_t, \qquad r_{\text{normal}} = 0.3$

The Gaussian covariance is parameterized as:

$\Sigma_i = Q_i\,\mathrm{diag}(s_t^2,\,s_t^2,\,s_n^2)\,Q_i^T$

where $Q_i$ aligns the local $z$ -axis to $n_i$ . Opacity logits are initialized as $\mathrm{logit}(0.1)$ . All scales are optimized in log space for numerical stability. This anisotropic, surface-aligned seeding dramatically reduces convergence time and conditions the model for rapid high-fidelity reconstruction.

4. Operator 𝒯: Hardware-Aligned Differentiable Splatting

Operator 𝒯 enables correct and efficient differentiable rendering on tile-based mobile GPUs:

4.1 Unrolled Alpha-Compositing with Forward Replay Cache

Manually unrolling the front-to-back alpha compositing equation,

$C_{\text{out}} = C_{\text{in}} (1 - \alpha) + \alpha c$

PocketGS stores a minimal replay cache $S = \{C_{\text{in}}, \alpha\}$ per pixel and a counter buffer of $O(WH)$ , permitting the correct gradient computation:

$\frac{\partial L}{\partial C_{\text{in}}} = \frac{\partial L}{\partial C_{\text{out}}} \cdot (1 - \alpha), \qquad \frac{\partial L}{\partial \alpha} = \frac{\partial L}{\partial C_{\text{out}}} \cdot (c - C_{\text{in}})$

This obviates the need for full splat lists or framebuffer readbacks.

4.2 Index-Mapped Gradient Scattering

Rendering requires parameter vectors $\theta$ to be arranged by depth. Gaussians are sorted on-GPU, producing an index mapping $\pi$ . Gradients are gathered in depth order during the forward pass and scattered back according to $\pi$ on the backward pass:

$\nabla\theta_{\pi(i)} += g_i$

Index mapping preserves optimizer state alignment and allows seamless backpropagation without CPU intervention.

4.3 On-GPU Adam Updates

All Adam optimizer moments and parameter updates execute within a single GPU kernel, eliminating host-device synchronization costs. Parameters are updated in logit space (opacity), log space (scales), and tangent-space (rotations), with all numerics in FP16 for stability.

5. End-to-End Mobile Capture-to-Rendering Workflow

The PocketGS workflow comprises:

Capture $\sim$ 50 seconds of video with ARKit pose tracking.
Information-gated keyframe selection (typically 8–15 frames).
GPU-native BA to refine camera trajectories and initial sparse points.
Single-reference MVS to produce a dense point cloud $\mathcal{P}$ .
ℐ-based initialization to seed surface-aligned, anisotropic Gaussians ( $\Theta_0$ ).
Adam-optimized differentiable splatting ( $\mathcal{T}$ ) for 500 iterations on-device.
Real-time rendering of the final Gaussian scene on the mobile device.

A companion mobile application implements the pipeline in Swift+Metal. All operations, including BA, MVS, initialization, and training, are fully on-device.

6. Experimental Results and Comparative Analysis

PocketGS is empirically benchmarked against two workstation 3DGS pipelines:

3DGS-SFM-WK: COLMAP SfM sparse prior + vanilla 3DGS.
3DGS-MVS-WK: COLMAP dense prior + vanilla 3DGS.

All systems are restricted to a 500-iteration budget and equivalent resolution. The following table summarizes performance on LLFF, NeRF-Synthetic, and MobileScan datasets (Guo et al., 24 Jan 2026):

Dataset	Method	PSNR↑	SSIM↑	LPIPS↓	Time (s)	#Gaussians
LLFF	3DGS-SFM-WK	21.01	0.641	0.405	108.0	18k
LLFF	3DGS-MVS-WK	19.53	0.637	0.387	313.1	40k
LLFF	PocketGS	23.54	0.791	0.222	105.4	33k
NeRF-Syn	3DGS-SFM-WK	21.75	0.800	0.243	83.7	12k
NeRF-Syn	3DGS-MVS-WK	24.47	0.887	0.128	532.1	50k
NeRF-Syn	PocketGS	24.32	0.858	0.144	101.4	47k
MobileScn	3DGS-SFM-WK	21.16	0.687	0.398	112.8	23k
MobileScn	3DGS-MVS-WK	20.85	0.781	0.281	534.5	165k
MobileScn	PocketGS	23.67	0.791	0.225	255.2	168k

PocketGS satisfies memory constraints on MobileScan (geometry prior peak: 1.19–2.22 GB, full training peak: 1.82–2.65 GB, all below the 3 GB threshold).

7. Ablation Studies and Operator Contribution

Ablations on MobileScan validate the necessity of every PocketGS operator:

Variant	PSNR↑	SSIM↑	LPIPS↓	Time (s)
Full PocketGS	23.67	0.791	0.225	255.2
w/o ℐ (anisotropic init)	22.49	0.770	0.253	319.5
w/o Global BA	23.45	0.752	0.232	251.1
w/o MVS	21.07	0.646	0.414	124.8

Key findings:

Removing operator ℐ (anisotropic initialization) forces isotropic seeds, resulting in a 1.2 dB PSNR decrease and 25% increased runtime.
Excluding global BA significantly degrades structural similarity (SSIM: 0.791 to 0.752).
Omitting MVS severely degrades reconstruction quality (PSNR: 23.67 to 21.07, LPIPS: 0.225 to 0.414).

This demonstrates that $\mathcal{G}$ 's lightweight MVS prior is critical for scene fidelity ceilings, $\mathcal{I}$ accelerates convergence, and $\mathcal{T}$ enables stable mobile-side differentiable training under memory and runtime constraints.

Through precise co-design of geometry prior construction, prior-conditioned initialization, and hardware-aligned optimization, PocketGS establishes a new paradigm for high-fidelity, efficient, fully on-device 3DGS training and rendering under stringent mobile hardware constraints (Guo et al., 24 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

PocketGS: On-Device Training of 3D Gaussian Splatting for High Perceptual Modeling (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PocketGS.