Sequential Variational Lower Bound

Updated 1 October 2025

Sequential variational lower bound is a method that extends the classical ELBO by explicitly modeling temporal dependencies in sequential data.
It leverages techniques such as FIVO and VSMC, using particle filtering to tighten the estimator for marginal likelihood and improve inference.
The approach enhances robust optimization and is pivotal in applications like time series analysis, speech, video, and deep sequential modeling.

A sequential variational lower bound is a variational objective formulated to provide a tractable, often tighter, lower bound on the marginal likelihood (evidence) for models involving sequential or temporally structured data. It generalizes the classical evidence lower bound (ELBO) used in static latent variable models by explicitly integrating the temporal or structural dependencies that arise in sequential data, such as time series, state-space models, and recurrent neural architectures. The sequential variational lower bound is fundamental for probabilistic learning and inference in modern machine learning, enabling scalable optimization and improved posterior approximation in challenging sequential tasks.

1. Foundations and Definitions

The classical ELBO for a single data point x and latent variable z is

$\log p(x) \geq \mathbb{E}_{q(z|x)} [\log p(x, z) - \log q(z|x)]$

In sequential latent variable models with observed data $\mathbf{x}_{1:T}$ and latent variables $\mathbf{z}_{1:T}$ , the joint distribution typically factorizes as

$p(\mathbf{x}_{1:T}, \mathbf{z}_{1:T}) = \prod_{t=1}^T p(x_t \mid z_t) p(z_t \mid z_{t-1})$

A sequential variational lower bound extends the ELBO by leveraging sequential structure: $\log p(\mathbf{x}_{1:T}) \geq \mathbb{E}_{q(\mathbf{z}_{1:T}|\mathbf{x}_{1:T})}\left[ \sum_{t=1}^T \log p(x_t|z_t) + \log p(z_t|z_{t-1}) - \log q(z_t|\cdot) \right]$ The conditioning in $q(z_t|\cdot)$ encapsulates possible choices, such as filtering (conditioning on the past) or smoothing (conditioning on all data), and is central to the approximation fidelity and downstream performance (Bayer et al., 2021).

2. Variational Sequential Monte Carlo and Particle Filtering Objectives

Sequential variational lower bounds can be constructed using Sequential Monte Carlo (SMC) methods, yielding objectives such as the Filtering Variational Objective (FIVO) (Maddison et al., 2017) and the Variational Sequential Monte Carlo (VSMC) bound (Naesseth et al., 2017). Key elements are:

The SMC estimate of the marginal likelihood at time T:

$\hat{p}_{N}(\mathbf{x}_{1:T}) = \prod_{t=1}^T \left( \frac{1}{N} \sum_{i=1}^N w_t^i \right)$

The FIVO bound:

$\mathcal{L}_N(\mathbf{x}_{1:T}) = \mathbb{E}[\log \hat{p}_N(\mathbf{x}_{1:T}) ] \leq \log p(\mathbf{x}_{1:T})$

Tightness of the bound depends on the variance of the estimator; particle filters (with resampling) yield variance scaling linearly, not exponentially, with sequence length (Maddison et al., 2017).
FIVO generalizes the ELBO: with a single particle and no resampling, the FIVO reduces to the ELBO; with more particles and resampling, the bound is tightened.
VSMC optimizes proposal parameters via gradient ascent on the surrogate variational lower bound:

$\tilde{\mathcal{L}}(\lambda) = \mathbb{E}_{SMC}[\log \hat{p}(y_{1:T}) ]$

This sequential bound can be made arbitrarily tight by increasing the number of particles (Naesseth et al., 2017).

3. Regularization, Robustness, and Alternative Lower Bounds

Sequential variational lower bounds are extensible and robustified in various formulations:

Robust ELBO: Additive $\varepsilon$ -shifting inside the log in the evidence leads to

$\log(\varepsilon + p(x_i | \theta))$

with the robust bound

$L_\varepsilon(X, \theta, \phi) = \sum_{i=1}^N \mathbb{E}_{q(z_i | x_i, \phi)}\left[\log\left(\varepsilon + \frac{p(x_i, z_i | \theta)}{q(z_i|x_i, \phi)}\right)\right]$

This construction downweights corrupted or noisy sequences—if $p(x_i| \theta) \ll \varepsilon$ , the sample's contribution to the gradient vanishes. Dynamically setting $\varepsilon$ via the average ELBO (with $\varepsilon = \alpha \cdot \exp(\mathcal{L}/|X|)$ ) further regularizes the training, nearly eliminating the influence of outlier or irrelevant trajectories (Figurnov et al., 2016).

Thermodynamic Variational Objectives (TVOs): The marginal log-likelihood is represented via thermodynamic integration as

$\log p_\theta(x) = \int_0^1 \mathbb{E}_{\pi_\beta}[ \log (p_\theta(x, z) / q_\phi(z|x)) ] d\beta$

The TVO lower bound approximates this integral by a Riemann sum over intermediate distributions, yielding a sequential tightening of the lower bound. The total discretization gap is the sum of KL divergences between adjacent $\beta$ -indexed mixtures of $q(z|x)$ and $p(x, z)$ (Masrani et al., 2019, Brekelmans et al., 2020).

4. Conditioning Structures and Approximation Fidelity

The structure of the variational posterior is critical in sequential models:

Fully Conditioned Smoothing: The true posterior depends on all data. Full conditioning, $q(z_t|x_{1:T})$ , enables learning of a generative model that matches maximum likelihood, with improved ELBO and predictive accuracy (Bayer et al., 2021).
Partial (Filtered) Conditioning: Using only $q(z_t|x_{\leq t})$ (as in filtering) creates a "conditioning gap"—the optimal amortized $q$ does not represent any actual mixture of posterior marginals but approximates a product-of-experts, leading to overconfident or misshapen posteriors. This can negatively affect downstream multi-step predictions (Bayer et al., 2021).
Experimentally: Models using full conditioning outperform partially conditioned ones in generative metrics and multi-step predictive performance on tasks like traffic flow, video, and vehicle trajectory prediction.

5. Applications Across Domains

Sequential variational lower bounds are central to a variety of applications:

Application Area	Sequential Bound/Methodology	Key Attribute or Result
Sequential VAEs/VSMC/FIVO	(Maddison et al., 2017, Naesseth et al., 2017, Nierop et al., 10 Jan 2025)	Improved log-likelihoods (outperforming ELBO and IWAE)
Robust VAEs (rVAE)	(Figurnov et al., 2016)	High noise immunity, downweighted outlier sequence impact
Deep Markov Models	(Naesseth et al., 2017)	Superior posterior approximation and model fidelity
Bayesian Phylogenetics	(Moretti et al., 2021)	Tighter ELBO via nested SMC, efficient exploration of space
Experimental Design	(Shen et al., 2023)	Actor-critic RL on a variational sequential lower bound
Constrained Bayesian Optimization	(Takeno et al., 2021)	Robust MI-based lower bound with MC concentration guarantees

These objectives have enabled new regularizers, improved optimization landscapes, and flexible adaptation to mini-batch, streaming, and reinforcement learning scenarios.

6. Algorithmic and Theoretical Properties

Gradient Computation: Sequential bounds often allow for gradient estimation via reparameterization tricks or specialized estimators, such as doubly reparameterized TVO gradients (Brekelmans et al., 2020) and greedy actor-critic updates in sequential experimental design (Shen et al., 2023).
Variance and Tightness: Increasing the number of SMC particles or Riemann-sum partitions systematically tightens the lower bound and can arbitrarily approach the true log-evidence in the infinite limit (Naesseth et al., 2017, Masrani et al., 2019, Struski et al., 2022).
Adaptive Scheduling: For TVOs, adaptive selection of intermediate $\beta$ schedules (so as to equipartition moment parameters along the exponential family path) minimizes the discretization error, closely matching optimal static grid search performance (Brekelmans et al., 2020).
Generalization: Bounds can be unified and analyzed via Bregman divergences and Taylor remainders, clarifying the sources of gap between bound and true evidence (Brekelmans et al., 2020).

7. Empirical Findings and Impact

Empirical studies consistently show that:

Sequential variational lower bounds, such as FIVO and VSMC, yield log-likelihood estimates up to 1 nat/time step better than standard ELBO/IWAE, with superior training stability and convergence rates on sequence data (polyphonic music, TIMIT speech, deep neural population recordings) (Maddison et al., 2017, Naesseth et al., 2017, Nierop et al., 10 Jan 2025).
Robustified objectives outperform standard VAEs when large fractions of training data are noise, as reflected in rVAE's ability to maintain high test likelihood even when uninformative noise dominates the dataset (Figurnov et al., 2016).
The more expressive and flexible the variational family (e.g., neural proposal parameterization, probabilistic circuits for discrete models), the tighter the bound, with improved posterior fidelity and generalization (Shih et al., 2020, Nierop et al., 10 Jan 2025).
Conditioning on all available observations eliminates the conditioning gap, yielding more accurate generative and predictive performance (Bayer et al., 2021).

The theoretical and practical advances in sequential variational lower bounds underpin recent progress in flexible, accurate, and robust probabilistic modeling of sequential data, state estimation, experimental design, and large-scale inference in structured probabilistic models.