Gaussian Differentiable Reordering (GDR)

Updated 6 December 2025

Gaussian-based Differentiable Reordering (GDR) is a learnable mechanism that uses Gaussian kernels for continuous relaxation of sorting, preserving local structure in point cloud data.
It integrates offset prediction with Gaussian-based KNN resampling to dynamically reorder tokens, ensuring spatial and feature coherence in neural models like Mamba.
Empirical evaluations show that optimal Gaussian softness in GDR improves classification accuracy by up to 1.6% over static reordering methods in point cloud tasks.

Gaussian-based Differentiable Reordering (GDR) is a mechanism designed to enable end-to-end learnable and differentiable token or point ordering, particularly within point cloud neural architectures that rely on ordered sequences for processing. By applying a Gaussian kernel to weight the assignment of features to discrete indices, GDR realizes a continuous relaxation of sorting that is fully differentiable and jointly trainable with other neural modules. This approach is especially beneficial for models such as State Space Models (SSMs), including Mamba, that process long input sequences but are fundamentally sensitive to input order—a structural mismatch for permutation-invariant point cloud data. GDR addresses this by adaptively reordering point or token sequences so that spatially or feature-wise similar entities are close in the serialized input, thus preserving spatial locality and improving downstream model performance (Liu et al., 3 Dec 2025).

1. Problem Motivation and Scope

Point clouds possess inherent permutation invariance and spatial irregularity, properties that challenge architectures which depend on fixed or pre-defined token serialization. Traditional scan orders—such as Hilbert or Z-curve mappings—are static, unable to capture local structural context or adapt to underlying data geometry, often resulting in distant serialization of spatially adjacent points. GDR resolves this mismatch by providing a gradient-based, learnable reordering policy. The central objective is to optimize the sequence such that local structure is preserved, enhancing feature coherence for models like deformable Mamba SSMs. This strategy is strictly designed for scenarios where serialization impacts downstream reasoning efficacy, such as classification, segmentation, and few-shot learning in point cloud tasks (Liu et al., 3 Dec 2025).

2. Mathematical Foundation

The GDR mechanism is grounded in canonical Gaussian weighting and soft aggregation. For a sequence of $N$ tokens (or points), with feature set $F \in \mathbb{R}^{N \times D}$ , base index vector $I \in \mathbb{R}^N$ ( $I_i = i$ ), and learned sequential offsets $\Delta t \in \mathbb{R}^N$ , the shifted index vector is

$s_i = I_i + \Delta t_i.$

The target index vector is $J = [1,2,...,N]$ . For each token $i$ , Gaussian-weighted assignment weights are computed:

$W(d;\sigma) = \exp(-d^2 / (2\sigma^2))$

where $d = s_i - J_j$ , and $\sigma_t$ controls the sorting relaxation. Corresponding normalized weights

$\alpha_{i,j} = \frac{W(s_i - J_j; \sigma_t)}{\sum_{l=1}^N W(s_i - J_l; \sigma_t) + \epsilon}$

yield the reordered features by soft assignment

$f'_i^{(t)} = \sum_{j=1}^N \alpha_{i,j} F'_j^{(s)}$

where $F'_j^{(s)}$ are locally resampled features (via Gaussian-based KNN Resampling, GKR). The parameter $\sigma_t$ modulates the softness: $\sigma_t \to 0^+$ approaches hard sorting, while large $\sigma_t$ collapses weights to uniform broadcasting (i.e., average pooling) (Liu et al., 3 Dec 2025).

3. Algorithmic Realization

The GDR procedure is realized in conjunction with Gaussian-based resampling as follows:

Offset Prediction: Contextual and global features are concatenated and projected through a depthwise separable convolution, channel attention, and conv1×1 layers. The output is split into spatial ( $\Delta p$ ) and sequential ( $\Delta t$ ) offsets, bounded by $\tanh$ for stability.
Local Resampling (GKR): Each point is spatially translated by $\Delta p$ , then $K_r$ nearest neighbors are identified and interpolated via normalized Gaussian weights, producing deformed, resampled features.
Differentiable Reordering (GDR): Shifted indices $s$ are computed, and Gaussian weights are assigned for reordering via the mechanism above, generating the final ordered feature set $F_{\text{out}}$ for SSM input.
Integration with SSM: $F_{\text{out}}$ is used as the input sequence for a deformable SSM such as Mamba.

The entire module is trained jointly via standard task losses, with gradients naturally flowing through all stages to both spatial and sequential offsets. No additional ordering-specific supervision is required. Continuous relaxation ensures differentiability and stable optimization (Liu et al., 3 Dec 2025).

4. Computational Complexity

GDR introduces moderate overhead relative to base model costs:

Stage	Time Complexity	Memory Complexity
Standard Serialization	$\mathcal{O}(ND)$	$\mathcal{O}(ND)$
Local Resampling (GKR)	KNN: $\mathcal{O}(N \log N)$ or $\mathcal{O}(N^2)$ (worst), weight/interpolation: $\mathcal{O}(N K_r D)$	$\mathcal{O}(N K_r)$
GDR	Weight matrix: $\mathcal{O}(N^2)$ , weighted sum: $\mathcal{O}(N^2 D)$	$\mathcal{O}(N^2)$

For typical $N = 128$ and $K_r = 3$ , overhead is marginal compared to the cost of SSM branches. Spatial indexing and batch-masking optimize KNN computation (Liu et al., 3 Dec 2025).

5. Empirical Effectiveness

Quantitative ablation studies on ScanObjectNN (OBJ_ONLY / PB_T50_RS) demonstrate measurable performance improvements:

Reordering Method	OBJ_ONLY (%)	PB_T50_RS (%)
Fixed Hilbert order	89.93	88.12
Non-differentiable sorting	90.81	89.01
GDR ( $\sigma_t=0.1$ )	91.35	89.32
GDR ( $\sigma_t=0.2$ )	91.56	89.40
GDR ( $\sigma_t=0.4$ )	91.23	89.29
GDR ( $\sigma_t=0.6$ )	91.12	89.04

Optimal Gaussian softness ( $\sigma_t=0.2$ ) yields the best trade-off between smoothness and gradient flow. Exclusion of GDR produces a 1.3–1.6% accuracy decline, confirming its impact on local continuity and overall performance in point cloud tasks (Liu et al., 3 Dec 2025).

6. Implementation Details and Hyperparameter Settings

Critical implementation choices include:

Resampling Neighborhood Size: $K_r = 3$ for best efficiency.
Gaussian Scales: $\sigma_s = 1$ (GKR), $\sigma_t = 0.2$ (GDR).
Offset Bounds: $\Delta p, \Delta t$ bounded to $[-1,1]$ via $\tanh$ .
Normalization Constant: $\epsilon = 1 \times 10^{-6}$ .
Batch Masking: Ensures neighborhood selection within each sample.
Offset Network Structure: Depthwise separable conv, channel attention, conv1×1 per Section 3.3 (Liu et al., 3 Dec 2025).
End-to-End Training: AdamW, cosine learning rate schedule, cross-entropy loss; no need for separate pre-training of GDR.

A plausible implication is that such hyperparameter selections affect feature smoothness, optimization stability, and the model’s ability to adapt serialization to varied point cloud geometries.

GDR, as realized in DM3D (Liu et al., 3 Dec 2025), is distinct from earlier permutation learning techniques including Gumbel-Sinkhorn, low-rank matrix factorization, and continuous relaxations such as SoftSort or ShuffleSoftSort (Barthel et al., 17 Mar 2025). Whereas prior methods often require $N^2$ parameters or fail to generalize to high-dimensional/irregular layouts, GDR leverages Gaussian kernels for both local resampling and global token reordering in a unified, differentiable process. No explicit per-token Gaussian parameterization $(\mu_i, \sigma_i^2)$ is used in GDR; all weights are computed on-the-fly from offsets, supporting end-to-end learning.

A common misconception—stemming from terminology overlap—is that all “Gaussian-based reordering” techniques represent each permutation position by a corresponding Gaussian distribution. In fact, GDR uses learned point shifts and index offsets, not per-token Gaussian parameters, to construct its kernel weights.

This suggests a broader utility for GDR as a general-purpose differentiable serialization primitive in domains where input sequence impacts spatial reasoning and continuity, but static or hard approaches are insufficient. Nonetheless, GDR’s quadratic memory scaling may limit direct deployment to very large $N$ unless block-wise or approximate kernel techniques are adopted.