Flow-based & Recurrent MDNs (FRMDN)
- Flow-based and Recurrent MDNs (FRMDNs) are sequence models that combine recurrent mixture density networks with normalizing flows to capture complex, multimodal data distributions.
- They parameterize predictive densities by transforming outputs via invertible affine-coupling layers, enabling efficient computation of Jacobian determinants.
- Empirical results demonstrate that FRMDNs outperform traditional RMDNs in image and audio tasks by providing state-of-the-art sequence likelihoods.
Flow-based and Recurrent Mixture Density Networks (FRMDN) represent a class of sequence models that augment classical recurrent mixture density networks (RMDNs) by integrating invertible transformations via normalizing flows into their probabilistic output heads. Through this hybridization, FRMDNs achieve highly expressive, tractable sequence likelihoods and offer state-of-the-art performance on diverse generative modeling benchmarks, notably in vision and audio settings. Closely related architectures, such as Recurrent Flow Networks (RFN), further expand this paradigm by disentangling spatial and temporal uncertainty through the joint parameterization of RNN hidden states, stochastic latent variables, and flow-based emission distributions.
1. Foundations: Recurrent Mixture Density Networks and Flow-based Extensions
The core of an RMDN consists of an RNN (typically an LSTM) that processes a sequence of inputs and previous targets to produce hidden state vectors at each time step. From this, parameters of a Gaussian mixture model (GMM)—mixture weights , means , and covariance structures —are generated, directly modeling the predictive density at each step. While this architecture is expressive, its ability to capture complex, nonlinear, and multimodal targets remains fundamentally limited by the parametric form of the GMM.
FRMDNs address this expressivity bottleneck by modeling the density not directly on but rather on a nonlinearly transformed version , where is a composition of invertible affine-coupling transformations, i.e., a normalizing flow of RealNVP type. The output density under the change of variables is
where is the standard RNN-GMM density and the Jacobian factors are efficiently computable due to the flow’s triangular structure (Razavi et al., 2020).
2. Model Architecture: Parameterization and Flow Construction
At each time step, the RNN yields a fixed-dimensional vector . A final linear projection is split into four blocks to parameterize the GMM:
- : Mixture logits, mapped to weights via softmax.
- : Component means .
- : Diagonal precision factors , exponentiated for positive definiteness.
- : Low-rank factors .
Covariance matrices are parameterized as , with in practice, balancing expressivity and computational tractability.
The normalizing flow is composed of affine-coupling layers, each splitting the -dimensional variable and applying invertible transformations:
- For each block :
where and are neural networks and the Jacobian is lower triangular, yielding efficient determinant computation.
3. Training, Inference, and Implementation Details
FRMDNs are trained via direct maximization of the observed-sequence log-likelihood:
with gradients efficiently propagated through both the RNN-GMM and flow layers due to the closed-form Jacobian. Optimization utilizes Adam or RMSProp with learning-rate scheduling, and covariance entries are clipped for stability during image-sequence modeling (Razavi et al., 2020).
Inference proceeds by:
- Computing mixture parameters from .
- Sampling from the Gaussian mixture.
- Applying the inverse flow to obtain .
Implemented RNNs utilize a single-layer LSTM with 256 hidden units (for image tasks), or several thousand units (in speech). Flows employ affine coupling layers with networks set as multilayer perceptrons utilizing LeakyReLU or Tanh activations.
4. Generalizations: Stochastic Latent States and Alternative Flow Parameterizations
Recurrent Flow Networks (RFN) (Gammelli et al., 2020) represent a complementary direction by further introducing stochastic latent variables alongside deterministic RNN states in the recurrent backbone. The generative process factorizes over time as:
The emission leverages a conditional normalizing flow whose base distribution's parameters and coupling layers are functions of both and . Variational inference is enabled with an amortized posterior within a filtering-style ELBO objective.
This architecture allows explicit modeling of both temporal uncertainty (via latent ) and spatial complexity in emissions (via flow-based outputs), permitting separation and joint control of sequence dynamics and multimodal spatial distributions, as empirically validated in spatiotemporal mobility data (Gammelli et al., 2020).
5. Empirical Results and Comparative Performance
FRMDNs have demonstrated consistent and significant improvements over baseline RMDNs and other flow-based generative models across multiple domains:
- Image-sequence modeling (CarRacing, SuperMario latent spaces, ): For and low-rank , FRMDN achieves best reported test NLLs (CarRacing: 2.35 nats/time-step, SuperMario: 1.28 nats/time-step), outperforming RMDN variants (improvements up to 0.05 nats/time-step at low ).
- Speech generation (Blizzard, TIMIT, Accent datasets): With mixture components and diagonal precisions, FRMDN surpasses RMDN by thousands of nats (e.g., Blizzard: FRMDN vs. RMDN ).
- Single-image density (MNIST, CIFAR-10): On MNIST, FRMDN achieves test NLL below zero ( nats), outperforming masked autoregressive flows (MAF $1,092.3$ nats) and RMDN ($1,189.9$ nats); on CIFAR-10, FRMDN ( nats) exceeds MAF and RealNVP by 20–120 nats.
RFN demonstrates state-of-the-art log-likelihood on spatiotemporal urban mobility datasets such as NYC taxi and Copenhagen bike-share, outperforming conventional MDN-RNNs, VRNNs, and ConvLSTMs especially for long rollout horizons and in capturing intricate, non-Gaussian spatial features (Gammelli et al., 2020).
6. Theoretical and Practical Significance
The integration of normalizing flows within recurrent mixture density networks yields two principal benefits:
- Expressivity: By modeling the density on a flexible, learned nonlinear transformation of the targets, FRMDNs capture highly non-Gaussian, multimodal, and nonlinear structure beyond the direct GMM parameterization.
- Tractability: The use of affine-coupling flows (e.g., RealNVP) provides computationally efficient, analytically tractable likelihoods, enabling exact sequence-level log-likelihood evaluation and stable maximum likelihood training.
Low-rank plus diagonal covariance parameterization balances computational efficiency with the ability to model complex covariances, aiding scalability to high-dimensional targets, as in image sequences.
An implication is that FRMDN and related architectures facilitate both improved generative quality and reliable sequence likelihoods for downstream use in model-based reinforcement learning, audio waveform modeling, and empirical sequence density estimation.
7. Extensions, Limitations, and Future Directions
Extensions such as those in RFN (Gammelli et al., 2020) suggest opportunities to further separate sources of sequence variability by embedding stochastic latent states and leveraging conditional flows in the emission process. The consistent likelihood benefits reported for both vision and audio, as well as continuous spatial domains, indicate broad applicability.
Limitations include the computational cost associated with deep flows for large target dimensions, and the potential need for careful hyperparameter tuning (mixture count , flow depth , low-rank size ). A plausible implication is the growing importance of hybrid architectures that combine the sample efficiency and uncertainty modeling of RNN-MDNs with the density estimation capacity of normalizing flows.
Overall, FRMDNs and variants constitute a robust and general framework for high-capacity, tractable sequential density modeling, effectively bridging autoregressive and flow-based generative paradigms (Razavi et al., 2020, Gammelli et al., 2020).