Papers
Topics
Authors
Recent
2000 character limit reached

VideoVAE Auto-Encoder for Deep Video Compression

Updated 23 November 2025
  • The paper introduces a VideoVAE auto-encoder that jointly learns spatial and temporal dependencies using distinct global and local latent codes for superior rate-distortion tradeoffs.
  • It employs a sequential VAE model with an LSTM-based autoregressive prior and uniform-noise quantization, enabling end-to-end neural video compression.
  • Empirical results on diverse datasets demonstrate that the approach achieves competitive performance at lower bitrates compared to traditional block-based codecs.

A VideoVAE auto-encoder is a class of end-to-end deep generative video codecs, designed to compress temporal video sequences by jointly learning spatial and temporal structure using sequential variational autoencoder (VAE) frameworks. The approach, exemplified by Deep Generative Video Compression (Han et al., 2018), departs from classical block-based motion coding paradigms by capturing both global static content and temporally local dynamics in a compositional latent space, enabling efficient neural video compression at competitive or superior rate-distortion tradeoffs.

1. Modeling Framework and Architecture

The architecture is organized around a sequential VAE that maps a short video segment x1:Tx_{1:T} into two classes of latent representations: a global code f\mathbf f representing holistic, temporally invariant features, and a sequence of local per-frame codes z1:Tz_{1:T} encoding dynamic, frame-specific variability. The encoding and decoding process is summarized as:

  • Encoder/Inference model: The variational approximation factorizes as

qϕ(f,z1:Tx1:T)=qϕ(fx1:T)t=1Tqϕ(ztxt)q_\phi(f,z_{1:T}\mid x_{1:T}) = q_\phi(f\mid x_{1:T}) \prod_{t=1}^T q_\phi(z_t\mid x_t)

where each posterior is a fixed-width uniform distribution centered at a neural network output, achieved via convolving each xtx_t, processing features through a bidirectional LSTM, and an MLP for f^\hat f; and per-frame convolutional encoder and MLP for each z^t\hat z_t.

  • Decoder/Generative model: The generative process is defined as

pθ(f,z1:T,x1:T)=pθ(f)pθ(z1:T)t=1Tpθ(xtzt,f)p_\theta(f,z_{1:T},x_{1:T}) = p_\theta(f)\,p_\theta(z_{1:T})\prod_{t=1}^T p_\theta(x_t\mid z_t,f)

where the emission distribution is a factorized Laplace with mean determined by a deconvolutional decoder fusing (zt,f)(z_t, f) via an MLP.

  • Temporal priors: The evolution of the sequence z1:Tz_{1:T} is modeled via (a) a deep Kalman filter with Markovian Gaussian transitions, or (b) an LSTM-based autoregressive prior, where the LSTM summarizes z<tz_{<t} and predicts Gaussian parameters for ztz_t.

Ablations confirm the importance of both the global code and an LSTM prior: a local-only variant (LSTMP-L) omits f\mathbf f, and a variant with Kalman prior and global code (KFP-LG) replaces the LSTM with a simpler one-step Kalman form.

2. Training Objective and Loss Formulation

Training is governed by a β-VAE (rate-distortion trade-off) objective:

  • Distortion: D=Eqϕ[logpθ(x1:Tf,z1:T)]D = -\mathbb E_{q_\phi}[\log p_\theta(x_{1:T}\mid f, z_{1:T})], representing expected negative log-likelihood under the Laplace frame emission, corresponding to an expected 1\ell_1 reconstruction loss.
  • Rate: R=Eqϕ[logpθ(f,z1:T)]R = -\mathbb E_{q_\phi}[\log p_\theta(f, z_{1:T})], measuring expected code length under the learned prior.

The total loss is

minϕ,θ  D+βR\min_{\phi,\theta}\;D + \beta R

or equivalently, maximization of the ELBO: L(ϕ,θ)=Eqϕ[logpθ(x1:Tf,z1:T)]+βEqϕ[logpθ(f,z1:T)]\mathcal L(\phi,\theta) = \mathbb E_{q_\phi}[\log p_\theta(x_{1:T}\mid f, z_{1:T})] + \beta\,\mathbb E_{q_\phi}[\log p_\theta(f, z_{1:T})] The parameter β\beta is swept to traverse the rate-distortion curve, enabling fine-tuned tradeoffs between compression rate and reconstruction fidelity.

3. Quantization and Entropy Coding

Compression proceeds by discretizing latent representations and encoding them into a bitstream:

  • Quantization: During inference and training, uniform noise of width 1 is added to each latent coordinate. For compression, latents are rounded to the nearest integer.

fˉ=round(f^),zˉt=round(z^t)\bar f = \mathrm{round}(\hat f),\qquad \bar z_t = \mathrm{round}(\hat z_t)

  • Entropy coding: The prior pθ(f)p_\theta(f) is modeled as a factorized, nonparametric density via invertible flows. Each pθ(ztiz<t)p_\theta(z_t^i\mid z_{<t}) is a Gaussian convolved with a uniform distribution, with parameters from the LSTM. Discretized latents are encoded via arithmetic coding, where, at each step tt, the coder subdivides the interval [0,1)[0,1) according to the predicted probability and encodes fˉ,zˉ1:T\bar f, \bar z_{1:T} as a binary sequence. Decoding reverses this process, guaranteeing agreement between encoder and decoder.

4. Temporal Modeling and Content Decomposition

The sequential prior pθ(ztz<t)p_\theta(z_t\mid z_{<t}) enables the model to directly learn evolution patterns of latent codes, contrasting with block motion estimation heuristics used in classical codecs. Inference of z^t\hat z_t is conditioned only on xtx_t, while f^\hat f summarizes all frames, aligning static or slowly-varying content with the global code and dynamics with local codes.

During generation, the LSTM prior recursively produces hth_t, from which ztz_t is sampled, and each frame is emitted via the Laplace decoder distribution. This structure captures both temporal correlation and innovation in the data sequence.

5. Implementation and Experimental Regime

Experiments are conducted on 64×6464\times64 RGB video segments (T=10T = 10 frames) sampled 4:4:4, no prefiltering. Benchmarked datasets include:

  • Sprites: synthetic, low-dimensional
  • BAIR Robot Push: constrained robotic video
  • Kinetics600: diverse, downsampled YouTube clips

Both encoder and decoder employ five-stage convolutional networks (4×4 kernels, strides 2→1, progressive padding, channels 192→3072), with latent and LSTM dimensions adjusted by dataset: e.g., (64, 512, 1024) for Sprites, (256, 2048, 3072) for BAIR/Kinetics. The global prior utilizes a nonparametric flow-based factorized density, while the local prior is Gaussian-uniform. Training is performed end-to-end with Adam, varying β\beta to explore bit-rate trade-offs.

6. Empirical Evaluation and Rate-Distortion Performance

Rate-distortion curves (PSNR vs bpp) on all datasets reveal:

  • Specialized domains (Sprites, BAIR): Best model (LSTMP-LG) achieves ≥40 dB at 0.05 bpp, while VP9/H.265 attain 20–25 dB at 0.5 bpp, a ~10-fold reduction in bitrate due to a tightly fit prior.
  • General domain (Kinetics): LSTMP-LG is competitive with state-of-the-art codecs in the 0.05–0.3 bpp range, despite the reduced spatial resolution.
  • Ablations: Transitioning from Kalman to LSTM priors lowers bitrates by 10–20%. Removing the global code reduces performance, highlighting the utility of disentangling static and dynamic components.

Supplementary results include MS-SSIM metrics, analysis of bit-rate allocation between global and local codes, and comparisons across sequence lengths, with VideoVAE maintaining competitiveness as H.265 segment durations increase.

7. Insights, Limitations, and Prospective Developments

The model demonstrates salient strengths on domain-specific video with regular structure, efficiently capturing both spatial and temporal dependencies. The VAE-based compression avoids blocking and ringing artifacts typical of block-based codecs, though at very aggressive bitrates, reconstructions are subject to mild blurring—a known characteristic of VAE decoders.

High-frequency textures and rapid motion in previously unseen contexts can exceed the model’s expressivity, leading to blurring or generated artifacts. Additionally, inference for the global code limits latent dimensions and hence complicates scaling to resolutions beyond 64×6464\times64 due to GPU memory constraints. Proposed future directions include fully convolutional or patchwise encoders to scale to high-resolution video, richer temporal priors (e.g., normalizing-flow RNNs), hybrid deterministic-stochastic architectures, side information learning analogous to motion vectors, and adversarial or perceptual loss integration for enhancing sample sharpness.

Overall, VideoVAE auto-encoders establish sequential VAEs, uniform-noise quantization, and neural entropy modeling as a viable, end-to-end paradigm for learned video compression, with demonstrated advantages on structured content and competitive performance on general datasets at low resolutions (Han et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to VideoVAE Auto-Encoder.