Smoothed Discrete Sampling (SDS)

Updated 4 February 2026

Smoothed Discrete Sampling (SDS) is a framework that smooths discrete data manifolds with Gaussian noise, facilitating continuous optimization and efficient sampling.
It combines score matching, contrastive divergence, and MCMC techniques to robustly navigate multimodal discrete spaces in applications like protein sequencing and text-to-3D synthesis.
SDS enhances training stability, variance reduction, and sample diversity, outperforming autoregressive and discrete diffusion models with simpler noise scale management.

Smoothed Discrete Sampling (SDS) is a general framework for generative modeling in which a discrete data manifold is smoothed via additive Gaussian noise, enabling continuous optimization, sampling, and denoising strategies. SDS forms the basis of several influential algorithms in both discrete sequence modeling and text-conditioned 3D synthesis, including Discrete Walk-Jump Sampling (dWJS) for protein sequences and the score distillation losses used in DreamFusion-style text-to-3D pipelines. The mathematical principles underlying SDS draw from energy-based modeling, diffusion processes, and scale-space theory, with key methodological distinctions relative to both autoregressive and multi-scale diffusion approaches. This article surveys the foundations, theoretical properties, and empirical performance of SDS in discrete domains, highlighting advances in gradient guidance, denoising, and variance reduction.

1. Formal Definition and Mathematical Framework

Let $x \in \{0,1\}^d$ denote a sample from a discrete manifold $\mathcal{M}$ , such as a protein sequence (one-hot representation) or a rendered image generated from a 3D scene parameterization. Smoothed Discrete Sampling (SDS) proceeds by perturbing $x$ with isotropic Gaussian noise of scale $\sigma$ ,

$y = x + \varepsilon,\quad \varepsilon \sim \mathcal{N}(0, \sigma^2 I_d),$

resulting in the smoothed data density

$p_\sigma(y) = \int_{x\in\mathcal{M}} p_{\text{data}}(x) \frac{1}{(2\pi\sigma^2)^{d/2}} \exp\left(-\frac{1}{2\sigma^2} \|y-x\|^2\right) dx.$

The associated smoothed energy function is $E_\sigma(y) = -\log p_\sigma(y)$ , with gradient (the “score”)

$\nabla_y E_\sigma(y) = -\nabla_y \log p_\sigma(y) = \mathbb{E}_{x\sim p(\cdot|y)}\left[ \frac{x-y}{\sigma^2} \right].$

This continuous relaxation enables efficient optimization and Markov Chain Monte Carlo (MCMC) sampling on the smoothed manifold, followed by a projection (denoising) “jump” step to the original discrete space. The score-matching and contrastive divergence objectives can be used to train parametric denoisers $g_\phi$ and energy-based models $f_\theta$ directly on $y$ sampled from $p_\sigma$ (Frey et al., 2023).

2. The Discrete Walk-Jump Sampling Algorithm

The Discrete Walk-Jump Sampling (dWJS) algorithm provides a practical realization of SDS in discrete generative modeling. The essential steps are:

Walk (Langevin MCMC on smoothed manifold):

$y_{t+1} = y_t - \eta \nabla_y E_\sigma(y_t) + \sqrt{2\eta}\,\xi_t,\quad \xi_t \sim \mathcal{N}(0, I_d)$

where the gradient may be replaced by a learned denoiser $g_\phi(y_t)$ or EBM score $\nabla_y f_\theta(y_t)$ .

Jump (One-step Denoising):

$\hat{x} = J(y_T) = \arg\max_{x\in\mathcal{M}} p(x|y_T) = \left\lfloor y_T + \sigma^2\,g_\phi(y_T) \right\rceil_{\text{one-hot}}$

projecting $y_T$ back to the nearest valid discrete configuration.

Training:
- Score matching: Trains $g_\phi$ by least-squares denoising,
$\mathcal{L}_{\text{SM}}(\phi) = \mathbb{E}_{x, \varepsilon}[ \| x - (y+\sigma^2 g_\phi(y)) \|^2 ],\quad y = x + \varepsilon.$ - Contrastive divergence: Trains $f_\theta$ to maximize likelihood on smoothed data and discriminate against negative samples obtained via short-run Langevin dynamics.

Pseudocode for the method is given in (Frey et al., 2023), with core quantities summarized in the following table:

Step	Symbolic Formulation	Description
Smoothing	$y = x + \varepsilon$	Add Gaussian noise
Langevin Walk	$y_{t+1} = y_t - \eta\, \nabla E_\sigma + \cdots$	MCMC on smoothed manifold
Denoising (Jump)	$\hat{x} = \text{one-hot} \left[ y_T + \sigma^2 g_\phi(y_T) \right]$	Project to discrete

3. Properties, Stability, and Theoretical Insights

The single-scale smoothing in SDS is instrumental in preventing instabilities common in EBM training, such as energy blow-up, and obviates the need for replay buffers or annealing. Empirical observations indicate stable training and sampling across $\sigma \in [0.5,4.0]$ , with instability only for very small $\sigma$ (undersmoothing regime) (Frey et al., 2023).

Noise scale ( $\sigma$ ): Must be large enough to smooth out discrete energy ridges but not so large as to erase multi-modal structure. Typical values are $\sigma \sim 0.5$ (proteins).
Mixing: dWJS mixes rapidly across distant modes and produces high-quality, diverse samples, outperforming diffusion and autoregressive baselines (10–100 $\times$ and 40 $\times$ faster, respectively, for protein generation).
Score-matching: Only a single noise scale is needed for training, aligning with Neural Empirical Bayes (Frey et al., 2023).

Training and sampling remain robust provided $\eta\|\nabla f\|^2\ll1$ and moderate $T$ (tens to hundreds of Langevin steps).

4. Comparisons to Other Generative and Diffusive Frameworks

SDS occupies a distinct space in the taxonomy of generative models:

Autoregressive models require sequential sampling, incurring slow inference and exposure bias.
Discrete diffusion models prescribe multi-scale (often hundreds) of noising and denoising iterations, with brittle schedule design and slow sampling.
DEEN (Deep Energy Estimator Networks) parameterize scores but lack explicit MCMC mechanisms or the mixing properties conferred by smoothing.

SDS/dWJS combines the flexible sampling properties of energy-based models (via MCMC) with the stability and sample quality of score-based models, requiring only a single noise scale $\sigma$ (Frey et al., 2023).

5. Discretization and Scale-Space Axioms in Gaussian Smoothing for SDS

The implementation of the Smoothing step in SDS requires careful consideration of Gaussian kernel discretization, especially when the smoothed manifold is derived from pixelated images or other grid-structured data. Three principal strategies are distinguished (Lindeberg, 2023):

Discretization Method	Key Strengths	Pitfalls
Sampling Approach	Simplicity, direct DL framework implementation	Not normalized at fine scales, breaks cascade for small $\sigma$
Integrated (Pixel-Integral)	Faithfully models pixel averaging, correct spatial average	Constant scale offset, breaks cascade property
Discrete-Analogue (Bessel Kernel)	Satisfies scale-space axioms: exact cascade, normalization, monotonicity	Needs Bessel function implementation

For fine scales ( $\sigma \lesssim 1$ ) or robust scale-space properties, the discrete-analogue method best preserves theoretical guarantees.
For moderate to coarse scales ( $\sigma > 1$ ), all methods can suffice, with sampling offering simplicity and pixel-integral kernels better modeling real-world sensor integration.

A key point: SDS emphasizes that digital data (e.g., images) are averages over finite pixel supports, not point samples (Lindeberg, 2023).

6. SDS in Text-to-3D Synthesis and Gradient Variance Reduction

In text-to-3D pipelines such as DreamFusion, SDS provides image-space guidance from frozen 2D diffusion models for 3D representation optimization. The standard SDS guidance gradient is (Lukoianov et al., 2024):

$\nabla_\psi L_{\text{SDS}} = \mathbb{E}_{t, \varepsilon, c}\left[ \sigma(t)\cdot(\epsilon_t(x_t, y) - \varepsilon) \cdot \frac{\partial g}{\partial \psi} \right]$

where $g(\psi, c)$ is the rendered image from 3D parameters $\psi$ at random camera $c$ .

Recent analysis demonstrates that SDS is a high-variance discretization of DDIM, as SDS samples i.i.d. noise at each step rather than tracking prompt-conditioned trajectories as in DDIM. This mismatch creates excessive update variance, leading to over-smoothed, cartoon-like outputs in 3D. The Score Distillation via Inversion (SDI) approach replaces i.i.d. noise with a prompt- and trajectory-matched estimate via DDIM inversion, restoring sample variance to the theoretical minimum and dramatically improving texture fidelity and detail (Lukoianov et al., 2024).

Validation experiments on 3D shape generation show SDI achieves superior CLIP and visual quality scores compared to SDS and other methods (Lukoianov et al., 2024).

7. Guidelines, Hyperparameters, and Practical Recommendations

Noise scale ( $\sigma$ ): Select to achieve desirable smoothing tradeoff; $\sigma\sim0.5$ for protein sequences, $\sigma>1$ for robust image smoothing.
Number of steps (T): Typically 10–200 for adequate manifold exploration.
Step size ( $\eta$ ): $\eta\sim10^{-2}$ to $10^{-3}$ , tuned for discretization stability.
Denoising projection: Use model-based denoisers or nearest-neighbor projection as appropriate for target manifold.
Manifold matching: For continuous domains (pixelated images), pixel-integral or discrete-analogue Gaussian convolutions yield better physical and mathematical fidelity (Lindeberg, 2023).
Variance reduction: For text-to-3D or DDIM-style workflows, replace random noise with trajectory- and prompt-conditioned inversion to improve detail preservation (Lukoianov et al., 2024).

SDS and its algorithmic realizations offer a robust, efficient, and theoretically sound paradigm for discrete generative modeling, with demonstrated advantages in mixing, sample quality, and practical simplicity relative to alternative approaches (Frey et al., 2023, Lindeberg, 2023, Lukoianov et al., 2024).