Normalizing Flows are Capable Generative Models (2412.06329v3)

Published 9 Dec 2024 in cs.CV and cs.LG

Abstract: Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present TarFlow: a simple and scalable architecture that enables highly performant NF models. TarFlow can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TarFlow is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both class-conditional and unconditional settings. Putting these together, TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model. We make our code available at https://github.com/apple/ml-tarflow.

Summary

The paper introduces TAR FLOW, a novel architecture that integrates autoregressive transformers with normalizing flows for superior likelihood estimation and sample quality.
It employs Gaussian noise augmentation, post-training denoising, and a guidance mechanism to significantly enhance model generalization and output fidelity.
Empirical results on benchmarks like ImageNet 64x64 demonstrate TAR FLOW's competitive performance, bridging the gap with diffusion models and GANs.

An Analysis of "Normalizing Flows are Capable Generative Models"

In this paper, the authors explore the potential of normalizing flows (NFs), a family of generative models that have been underutilized relative to other state-of-the-art methodologies such as Diffusion Models and LLMs. The paper introduces a novel architecture, TAR FLOW, which is designed to augment the capability of NFs, leveraging the strengths of autoregressive transformers. This approach yields substantial improvements in both likelihood estimation and sample generation tasks, positioning TAR FLOW as a formidable contender in the field of generative modeling.

The core of their contribution is the TAR FLOW model, which employs a stack of autoregressive Transformer blocks with a masked autoregressive flows (MAFs) variant architecture. This setup is enhanced by alternating autoregression directions between layers, facilitating effective end-to-end training and enabling direct pixel-level modeling and generation. The architecture's efficacy is proven through its ability to achieve state-of-the-art results in image likelihood estimation, notably surpassing previous methods by a significant margin and, for the first time, achieving sample quality and diversity on par with diffusion models for standalone NF models.

The introduction of three pivotal techniques further optimizes the generative output and sample quality of TAR FLOW:

Gaussian Noise Augmentation: Unlike traditional methods utilizing a slight amount of uniform noise, this paper emphasizes the effectiveness of moderate Gaussian noise during training, which improves the model's generalization skills and supports the production of higher fidelity samples.
Post-Training Denoising: The paper advocates a novel, efficient score-based denoising method, which cleanses sample outputs by estimating their clean counterpart, thus enhancing perceptual quality.
Guidance Mechanism: Implementing guidance during sample generation is shown to be compatible with normalizing flow models, significantly boosting the output quality in both class-conditional and unconditional models.

The empirical evaluation of TAR FLOW showcases outstanding performance on several benchmarks, such as ImageNet 64x64 bit per dimension (BPD) evaluation, achieving less than 3 BPD, which marks a notable improvement. Furthermore, in generation tasks, the model's samples achieve competitive Fréchet Inception Distance (FID) scores on conditional ImageNet datasets, showcasing capabilities close to those of top-tier diffusion models and GANs.

From a practical standpoint, the advancements in TAR FLOW suggest viable pathways for deploying NFs efficiently in real-world applications, given its ability to scale and produce high-quality outputs. This paper demonstrates that with careful architectural choices and innovative training techniques, normalizing flows can rival more popular generative models, challenging the preconceived limitations of the NF paradigm.

Looking forward, the implications of this research are manifold. The demonstrated integration of autoregressive transformers and normalizing flows can inspire further exploration in diverse domains beyond image generation, such as audio processing or even designing novel architectures for emerging AI challenges. The scalability and modularity highlighted by TAR FLOW beckon a reconsideration of the NF potential, promoting ongoing research into optimizing sample efficiency and exploring alternative architectural implementations.

This paper exemplifies a significant step in closing the performance gap between normalizing flows and other prominent generative models, providing a solid foundation for future exploration and utilization of NFs in the AI landscape.