Feed-Forward 3DGS Compression
- Feed-forward 3DGS compression frameworks are rapid, optimization-free methods for compact 3D scene representation using neural transforms and adaptive quantization.
- They leverage long-context modeling via Morton serialization and attention mechanisms to preserve spatial locality and boost entropy coding efficiency.
- Empirical results show ~20× compression with minimal quality loss, achieving superior rate–distortion performance and rendering fidelity compared to traditional methods.
Feed-Forward 3DGS (3D Gaussian Splatting) compression frameworks are a class of algorithms enabling fast, optimization-free compression of large-scale 3DGS scene representations. These approaches achieve high compression ratios with sublinear compute cost via neural or analytic transform coding, adaptive quantization, advanced entropy models, and context modeling. The following sections detail the methodologies, core modules, and empirical performance of state-of-the-art feed-forward 3DGS compression frameworks, with a focus on recent advances in long-context modeling exemplified by LocoMoco (Liu et al., 30 Nov 2025), as well as comparisons to alternative paradigms (Song et al., 11 Jun 2025, Liu et al., 2024, Chen et al., 2024).
1. Motivation and Scope
3D Gaussian Splatting (3DGS) has emerged as a powerful representation for novel-view synthesis and real-time 3D reconstruction. However, the large scale and redundancy of typical 3DGS models—often comprising hundreds of thousands or millions of Gaussians, each with dozens of attributes—constitute a major barrier to their widespread transmission, sharing, and storage. Traditional compression methods frequently rely on per-scene optimization, leading to high computational cost and scene-specific artifacts. Feed-forward 3DGS compression frameworks address this by enabling rapid, generalizable compression through a single pass over the data, with no per-scene learning or iterative optimization (Liu et al., 30 Nov 2025, Chen et al., 2024, Song et al., 11 Jun 2025).
2. Core Framework Components
The canonical feed-forward 3DGS compression pipeline consists of the following stages:
- Representation structuring: Transforming unordered sets of Gaussians into structured sequences or blocks amenable to neural processing and context modeling.
- Transform coding: Applying neural attention blocks, analysis transforms, or domain-specific encodings to decorrelate and compactly encode Gaussian attributes.
- Quantization: Discretizing the continuous-valued attributes (position, color, scale, orientation, SH coefficients) via uniform or learned quantizers.
- Entropy coding: Assigning bitrates based on probabilistic models of symbol likelihood conditioned on side information, context, or hyperpriors.
In LocoMoco (Liu et al., 30 Nov 2025), the cornerstone is the use of large context windows derived from Morton-order (Z-order curve) serialization of 3D positions, which ensures that spatially adjacent Gaussians remain close in sequence for context-aware transforms and entropy coding.
3. Long-Context Modeling via Morton Serialization and Attention
The long-range dependency modeling is realized via the following design choices:
Morton Serialization
- Quantize each Gaussian center to -bit integers.
- Compute Morton index by interleaving bits of , , :
- Sort Gaussians by increasing , creating a 1D sequence preserving spatial locality.
- Partition into windows of length (typically 1024), enabling the modeling of context over thousands of neighboring Gaussians.
Attention-Based Transform Coding
- Each context window is encoded via a positional encoding built on a 3-layer DGCNN, capturing local geometry.
- Standard multi-headed self-attention (QKV) is performed over all embeddings, allowing each Gaussian to aggregate information from both adjacent and far-flung spatial neighbors within its window:
- Downstream, the latent vectors resulting from attention are employed as hyperpriors for entropy modeling.
This architecture enables the effective capture of both local and long-range correlations, which standard local voxel-grid approaches fail to model.
4. Fine-Grained Auto-Regressive Entropy Modeling
LocoMoco employs a novel entropy model that jointly factorizes the codebook symbols per context window along both spatial and channel axes:
- Space-channel factorization:
- Partition the symbol sequence into anchor (even) and non-anchor (odd) spatial indices, as well as channel groups.
- Conditional coding:
- For each subgroup, symbol probabilities are conditioned on:
- Previously decoded channels (channel context)
- Previously decoded anchors (spatial context, for non-anchor symbols)
- Latent hyperpriors
The conditional probability structure is:
where incorporates context from already-coded channel and space elements, and is the latent prior.
- Rate–Distortion Optimization:
- The total rate is the sum of negative log-likelihoods for all coded symbols and hyperpriors.
- The overall loss function is:
Here is a weighted combination of MSE and SSIM over rendering operators.
Three-stage training is performed: proxy pretraining, staged optimization of components, and end-to-end fine-tuning over the rate–distortion objective.
5. Empirical Results and Comparative Performance
Key findings on the DL3DV-GS, Mip-NeRF 360, and Tanks & Temples benchmarks (Liu et al., 30 Nov 2025):
- Compression ratio: Achieves ~20× reduction in raw 3DGS size.
- Quality preservation: PSNR loss ≤ 0.5 dB; superior rate–distortion trade-off compared to prior feed-forward methods, especially FCGS (Chen et al., 2024).
- Bitrate savings: BD-Rate savings of –10.1% (DL3DV-GS), –9.4% (Mip-NeRF), –10.4% (Tanks & Temples) relative to FCGS at the same visual fidelity.
- Qualitative outcomes: LocoMoco retains sharp and color-faithful renderings at high compression rates, whereas FCGS tends to introduce blur or color-shift artifacts.
- Ablations: Removal or reduction of window size, channel/spatial context, or DGCNN attention leads to substantial (>10–30%) BD-Rate degradation, underscoring the necessity of long-range and contextual modeling.
Table: Rate–Distortion Comparison (DL3DV-GS, Mip-NeRF 360, Tanks & Temples)
| Method | Compression Ratio | BD-Rate Savings | Visual Quality Impact |
|---|---|---|---|
| LocoMoco | ~20× | –10% vs. FCGS | High-fidelity |
| FCGS | ~20× | Baseline | Blur, color shift |
| Uncompressed | 1× | – | Reference |
6. Implementation and Practical Considerations
- Training data: DL3DV-GS (6,770 scenes), with cross-benchmark evaluations.
- Window length: Default , with ablations showing performance drop for shorter windows.
- Feed-forward inference: One forward pass suffices; test-time pipeline is quantization, Morton serialization, window partitioning, and one-pass attention + entropy coding/decoding.
- Resource requirements: Encoding/decoding per scene on GPU is ~11–13 s, with peak memory ~45 GB.
- Hybrid lossless/lossy coding: Division strategy in the entropy model yields ~1 dB trade-off in PSNR between all-lossless and all-lossy color paths; the hybrid mode is optimal.
7. Limitations, Future Directions, and Broader Context
- Scalability: For extremely large scenes, batching or hierarchical windowing may be needed.
- Dynamic content: Extension to 4D (temporal) compression requires temporal context modeling.
- Efficiency: High GPU memory/compute footprints suggest future work exploring efficient attention architectures (e.g. Linformer, Performer) or quantized networks.
- Streaming: Integration with streaming arithmetic coding is proposed.
Feed-forward 3DGS compression with long-context modeling delineates a new state-of-the-art in generalizable, rapid, and high-fidelity 3D scene compression. These frameworks converge toward the rate–distortion efficiency of optimization-based pipelines but with orders-of-magnitude speedup and broad applicability (Liu et al., 30 Nov 2025, Chen et al., 2024, Song et al., 11 Jun 2025, Liu et al., 2024).