Transformation Autoregressive Networks (TANs)
- Transformation Autoregressive Networks (TANs) are defined by combining invertible mappings with autoregressive conditional models to flexibly estimate complex data distributions.
- They merge techniques from normalizing flows and autoregressive models, enabling superior performance on synthetic benchmarks, image modeling, and anomaly detection.
- TANs leverage components like RNN-rescale and additive shift layers to facilitate efficient density learning and support applications in meta-learning and outlier detection.
Transformation Autoregressive Networks (TANs) are a class of tractable density estimators designed for general density estimation tasks. TANs integrate two previously competing methodological families: (1) the change-of-variables formula via a composition of invertible smooth transformations as in normalizing flows, and (2) autoregressive modeling of conditional densities, which factorizes the joint density into a product of one-dimensional conditionals. By learning both a powerful invertible mapping and an expressive autoregressive factorization, TANs achieve greater flexibility than either normalizing flow or autoregressive models in isolation. This framework supports density modeling across synthetic, structured, and real-world settings, and extends to applications such as anomaly detection, image modeling, and meta-learning for families of distributions (Oliva et al., 2018).
1. Mathematical Formulation
TANs seek to model the density for vectors via an invertible map such that is more amenable to autoregressive factorization. The resulting density factorization applies the change-of-variables formula:
The transformed density is modeled autoregressively:
The composite log-likelihood objective is:
where is the Jacobian. The model allows to be a composition of multiple layers , with the total log-determinant being a sum over per-layer determinants.
2. Core Architectural Components
Autoregressive Conditional Models
After applying , the conditional densities are modeled as mixtures of Gaussians with neural network-computed parameters. Two main architectures are proposed for summarizing into the hidden state :
- Linear Autoregressive Model (LAM): , with separate per . This provides maximal capacity at parameter cost .
- Recurrent Autoregressive Model (RAM): , for a recurrent cell (e.g., GRU or LSTM), allowing parameter sharing and non-Markovian dependency modeling. The outputs of are passed through an MLP to produce mixture parameters.
Invertible Transformations
Several classes of are introduced:
- Linear (Global): , with (triangular decomposition) for tractable determinant and inversion.
- Invertible RNN-Rescale: For , ; with leaky-ReLU, ReLU, and all parameters learnable. Determinant is tractable; inversion proceeds recursively.
- Additive RNN Shift: , , is a small MLP, is an RNN cell. The Jacobian is unit-triangular, so determinant is 1.
- Coupling Layers: Additive coupling (NICE/Real NVP-style) splits and applies MLP shift, a special case of additive RNN shift.
- Stacking and Reversal: Multiple transformation layers are stacked; reversal () is interleaved to enable bidirectional dependencies.
Empirically, 4–5 stacked layers combining linear, invertible RNN, and additive shift perform best for complex density estimation.
3. Training and Sampling Procedures
TANs are trained by maximizing the likelihood via gradient descent. Training alternates forward passes through all transformation layers, summing log-determinants, and applying the autoregressive model in the transformed space. The conditional likelihood is computed as the sum of the mixture log-probabilities. The Adam optimizer is used, with initial learning rate , decayed periodically, and gradient clipping at norm 1.
Sampling proceeds by ancestral sampling in the transformed () space using the autoregressive conditionals, followed by inversion through the sequence of transforms to obtain samples in space. All operations are fully differentiable.
4. Model Variants and Crossed Evaluations
TANs support systematic variation in both conditional modulator and transformation components:
| Conditional Module | Transformation Stack |
|---|---|
| SingleInd (factorized Gauss) | None, Linear only |
| MultiInd (non-conditioning) | 1× RNN-rescale, 2× RNN-rescale (with reversal) |
| Tied (NADE-style) | 4× additive coupling (NICE), 4× additive RNN-shift |
| LAM | Linear RNN-rescale 4× additive shift |
| RAM | Stacked as above |
Empirical observations establish that pure autoregressive models struggle with high non-Markovian or nonlinearly entangled data, while pure flows (e.g., NICE) are ineffective when the base distribution exhibits dependencies. The most capable models pair rich transformation stacks (linear + RNN + shift) with LAM or RAM conditioners.
5. Hyperparameter Settings
- Conditionals: Mixtures of 40 Gaussians per conditional.
- Hidden size: 120 for LAM/RAM states; 16 for RNN transforms.
- Optimization: Adam, initial learning rate , learning rate decay (0.1 or 0.5 every 5,000 steps), 30,000 steps total.
- Batch size: 256 (increased to 1024 for large UCI datasets).
- Gradient clipping: norm 1.
- Leaky-ReLU slope for invertible RNNs.
- No normalization penalty for additive shift layers (unit-triangular Jacobian).
6. Empirical Validation and Applications
Synthetic Benchmarks
On Markovian sinusoids with random-walk noise (), TANs using RAM+4×SRNN+Reversal achieve log-likelihood , outperforming single-prong methods. On star-structured graphical models (), TANs handle both hub and peripheral structure, where pure transformation or conditional approaches struggle.
Real-world Datasets
For UCI datasets (POWER, GAS, HEPMASS, MINIBOONE, BSDS300 with up to ), TANs surpass MADE, Real NVP, and MAF models, e.g., on POWER, achieving $0.60$ nats compared to $0.24$ nats for MAF(10).
Image Modeling
Continuous pixel modeling on MNIST () and CIFAR-10 () after dequantization and logit-transform shows bits-per-pixel improvements of –$0.2$ over MAF and Real NVP. Generated samples exhibit high visual quality and digit-like appearance.
Anomaly Detection
TANs, when used for outlier detection (ODDS datasets: forest, pendigits, satimage2), achieve the highest average precision, implying that the learned densities capture semantically meaningful structure.
Learning Distribution Families
TANs are combined with DeepSets encoders for modeling a continuum of distributions. For sets each from , a shared TAN is conditioned on a permutation-invariant embedding . For ShapeNet point clouds (category as family), the model generalizes to previously unseen objects.
7. Significance and Unified Perspective
TANs constitute a unified framework interpolating between normalizing flows and autoregressive models. By simultaneously learning invertible transformations (linear, RNN-rescale, additive shift) and rich autoregressive conditioners (LAM, RAM), TANs achieve state-of-the-art performance in density estimation, anomaly detection, image modeling, and meta-learning of distribution families. The empirical results demonstrate that jointly leveraging expressive transformations and conditional modeling provides superior flexibility and accuracy compared to using either alone (Oliva et al., 2018).