G-Mamba Diffusion Encoder

Updated 29 March 2026

G-Mamba Diffusion Encoder is a family of denoising encoders that employ bidirectional Mamba State Space Models for efficient conditional and unconditional generation.
Its architecture replaces quadratic self-attention with linear-time 1D SSM convolutions, enabling high-resolution modeling and reduced computational cost.
Variants extend the model to 3D voxels, image spatial-frequency fusion, and discrete masked diffusion for language, showcasing broad applicability across domains.

The G-Mamba Diffusion Encoder is a family of diffusion denoising encoders utilizing the Mamba State Space Model (SSM) architecture, designed for fast, scalable, and high-fidelity conditional and unconditional generation across domains such as 3D voxelized point clouds, images, and text. Distinguished by the replacement of quadratic-cost self-attention with linear-time bidirectional SSMs, and further enhanced in some variants by global attention, frequency-domain reasoning, and adaptive conditioning, G-Mamba encoders achieve strong generative modeling efficiency without quality degradation (Mo, 2024, Phung et al., 2024, Singh et al., 19 Nov 2025).

1. Architectural Principles and Dataflow

The G-Mamba Diffusion Encoder employs a stack of bidirectional Mamba SSM blocks organized in a transformer-like fashion, but eschews multi-head attention for linear-complexity 1D SSM convolutions. Input features—such as voxelized point clouds $x \in \mathbb{R}^{X \times Y \times Z \times C}$ , images in the latent space, or token embeddings—are patchified and embedded. The sequence is prepended with a class or condition token, and then processed through $K$ stacked G-Mamba (DiM) blocks.

Each block includes:

LayerNorm and linear projection on tokens.
Bidirectional SSM scan: forward and backward 1D convolutions with SSM kernels parameterized per block via learned matrices $(\bar A, \bar B, \bar C)$ , derived from continuous SSM parameters through discretization.
Output fusion: the forward and reversed-backward convolution outputs are added.
Skip-add & MLP (e.g., GEGLU or GELU nonlinearity).

The resulting sequence is projected back to the original or latent space dimensions. This architectural strategy enables $O(LD)$ compute and memory per block, as opposed to $O(L^2 D)$ in transformer-based encoders with full self-attention (Mo, 2024).

2. State-Space Model Formulation

Underlying each G-Mamba block is the Mamba SSM, formally written (in continuous time) as: $\frac{d}{dt} h(t) = A h(t) + B x(t), \qquad y(t) = C h(t).$ Discretization yields: $h_t = \bar{A} h_{t-1} + \bar{B} x_t, \qquad y_t = C h_t,$ where $\bar{A} = \exp(\Delta A)$ , $\bar{B} = (A^{-1}(e^{\Delta A} - I)) B$ , and step size $\Delta$ can be made input-dependent (Singh et al., 19 Nov 2025).

Bidirectional sequence scanning is implemented by applying forward-convolution kernels to the input and mirrored backward-convolution kernels to the reversed sequence, followed by fusion. This bidirectionality recovers global mixing absent from uni-directional SSMs. All convolutions are parallelizable and scale linearly with sequence length.

3. Domain-Specific Extensions

3.1 3D Shape and Voxel Generation

For 3D point clouds and voxel grids, G-Mamba encoders (as in DiM-3D) first patchify the voxel tensor into non-overlapping cubic patches, embed these patches, and process the sequence through stacked G-Mamba blocks. A final projection reconstructs predicted noise $\hat\epsilon_\theta(x_t, t)$ in the original voxel grid for diffusion denoising. Experimental evidence shows that this approach allows efficient training and inference for large grids (up to $2048^3$ ) while outperforming DiT-based transformer baselines on both distribution matching (1-NNA/CD, COV/CD) and completion tasks (Mo, 2024).

3.2 Visual Domain: Spatial-Frequency Unification

The DiMSUM variant incorporates explicit 2-level Haar wavelet decomposition of the input feature map, yielding a 1-D sequence concatenating all spatial-frequency subbands. Separate “spatial-Mamba” and “wavelet-Mamba” blocks process the image patch and wavelet representations. A “query-swap” cross-attention layer tightly fuses their outputs by reciprocal cross-attention between the spatial and frequency streams, followed by shared-transformer blocks for order-invariant global mixing. This hybridizes the local structure-exploiting bias of SSMs with global frequency reasoning and periodic attention (Phung et al., 2024).

3.3 Language Modeling: Discrete Masked Diffusion

In text diffusion architectures such as DiffuApriel, G-Mamba encoders are adapted for discrete masking schedules. Input tokens are randomly replaced with [MASK] at variable noise levels; the encoder is conditioned on timestep embeddings via adaptive LayerNorm. The block structure remains bidirectional Mamba, optionally interleaved with transformer attention layers (hybrid DiffuApriel-H). The diffusion denoising process relies on the G-Mamba network to estimate the conditional token distribution, achieving substantial throughput gains and reduced perplexity compared to transformer-based masked diffusion LMs (Singh et al., 19 Nov 2025).

4. Diffusion Denoising and Training Objectives

Across modalities, G-Mamba encoders serve as the core denoising model within either continuous (DDPM-style, flow-matching) or discrete (masking) diffusion frameworks.

Forward process: For continuous data, the additive Gaussian noising process $q(x_t|x_{t-1}) = \mathcal{N}(\sqrt{1-\beta_t} x_{t-1}, \beta_t I)$ is adopted, or, in discrete settings, token-level Markov masking with variable noise (Mo, 2024, Singh et al., 19 Nov 2025).
Reverse process: The denoiser $p_\theta(x_{t-1}|x_t)$ is parameterized in terms of noise estimation (continuous) or masked token prediction (discrete).
Training objective: For continuous cases, mean squared error on predicted noise:

$\mathcal{L}_{simple} = \mathbb{E}_{t, x_0, \epsilon} \| \epsilon - \epsilon_\theta(x_t,t) \|^2$

For discrete masking, the loss is the reweighted cross-entropy over masked tokens:

$\mathcal{L}_{MDM} = \int_0^1 \frac{1}{t} \mathbb{E}_{q_{t|0}}\left[ \sum_{i: x_t^i = MASK} -\log p_\theta(x_0^i|x_t) \right] dt$

Timestep and condition embeddings are injected using MLPs and/or AdaLN mechanisms.

5. Complexity, Scalability, and Empirical Results

The defining property of G-Mamba encoders is $O(L D)$ per-block complexity, compared to the $O(L^2 D)$ of transformers. Empirical evaluations confirm:

3D Shape Generation (DiM-3D, (Mo, 2024))
- DiM-3D-XL/2 achieves a reduction from 343.28 Gflops (DiT-3D-XL/2) to 294.58 Gflops at $256^3$ resolution.
- Maintains or surpasses baseline generation quality (e.g., Chair 1-NNA (CD): 45.78 for DiM-3D vs. 49.11 for DiT-3D-XL).
- Enables high-resolution modeling without out-of-memory (OOM) issues at large voxel grids.
Image Generation (DiMSUM, (Phung et al., 2024))
- On ImageNet 256 $\times$ 256, DiMSUM (460M params): FID=2.11, Recall=0.59, compared to DiT-XL/2 (675M): FID=2.27, Recall=0.57.
- Notably faster convergence: 200–400 epochs for DiMSUM versus up to 1.4k for comparable DiT SDEs.
- Architectural hyperparameters include 20 blocks (16 DiM, 4 shared transformers), hidden dim $d=1024$ , batch sizes up to 704.
Language Modeling (DiffuApriel, (Singh et al., 19 Nov 2025))
- At 1.3B scale, G-Mamba achieves $\sim$ 4.4 $\times$ inference throughput over transformer diffusion LMs and matches or outperforms them in validation perplexity (e.g., 20.17 vs 22.72 PPL).
- Throughput is stable for increasing sequence lengths due to linear scaling.

6. Comparative Insights and Hybrid Variants

G-Mamba encoders have been competitively benchmarked against DiT, DIFFUSSM, and large transformer diffusion baselines. Key observations include:

G-Mamba can enable higher-resolution generative modeling, with comparable or improved qualitative and quantitative performance, and substantial resource savings.
Hybrid variants (DiffuApriel-H, DiMSUM) interleave global attention or globally-shared transformer blocks every $N$ G-Mamba layers. This reintroduces order-invariant mixing without forfeiting linear scaling ( $O(L D) + O(L^2 D / K)$ per hybrid block structure).
In multimodal domains, frequency and spatial information is most effectively fused at each stage, as shown by DiMSUM's cross-attention mechanism.

7. Hyperparameters, Implementation, and Training Setup

Standard hyperparameters for G-Mamba diffusion encoders vary by modality:

Model	Depth (Blocks)	Hidden Dim.	Patch Size	Training Objective	Training Time/Epochs
DiM-3D-XL	36	1152	4	MSE (noise pred.)	2,000–10,000
DiMSUM-L/2	20 (16+4)	1024	2	Flow-Matching	225–510
DiffuApriel	24	1920	—	Reweighted CE	See (Singh et al., 19 Nov 2025)

Optimization typically relies on Adam with a learning rate of $1 \times 10^{-4}$ , batch sizes up to 704 for vision, and dropouts (e.g., 0.1–0.2). LayerNorm/AdaLN is used for time-step conditioning. For language, standard tokenization (GPT-2) and masking schedules are employed.

In sum, the G-Mamba Diffusion Encoder realizes a scalable, efficient, and domain-general approach to conditional denoising diffusion modeling, leveraging bidirectional state-space models, optionally hybridized with global attention or spatial-frequency fusion, and establishes a consistent empirical advantage in speed and quality across text, image, and 3D shape domains (Mo, 2024, Phung et al., 2024, Singh et al., 19 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (3)

Efficient 3D Shape Generation via Diffusion Mamba with Bidirectional SSMs (2024)

DiMSUM: Diffusion Mamba -- A Scalable and Unified Spatial-Frequency Method for Image Generation (2024)

Breaking the Bottleneck with DiffuApriel: High-Throughput Diffusion LMs with Mamba Backbone (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to G-Mamba Diffusion Encoder.

G-Mamba Diffusion Encoder

1. Architectural Principles and Dataflow

2. State-Space Model Formulation

3. Domain-Specific Extensions

3.1 3D Shape and Voxel Generation

3.2 Visual Domain: Spatial-Frequency Unification

3.3 Language Modeling: Discrete Masked Diffusion

4. Diffusion Denoising and Training Objectives

5. Complexity, Scalability, and Empirical Results

6. Comparative Insights and Hybrid Variants

7. Hyperparameters, Implementation, and Training Setup

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

G-Mamba Diffusion Encoder

1. Architectural Principles and Dataflow

2. State-Space Model Formulation

3. Domain-Specific Extensions

3.1 3D Shape and Voxel Generation

3.2 Visual Domain: Spatial-Frequency Unification

3.3 Language Modeling: Discrete Masked Diffusion

4. Diffusion Denoising and Training Objectives

5. Complexity, Scalability, and Empirical Results

6. Comparative Insights and Hybrid Variants

7. Hyperparameters, Implementation, and Training Setup

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research