STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis (2506.06276v1)

Published 6 Jun 2025 in cs.CV, cs.AI, and cs.LG

Abstract: We present STARFlow, a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis. The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers. We first establish the theoretical universality of TARFlow for modeling continuous distributions. Building on this foundation, we introduce several key architectural and algorithmic innovations to significantly enhance scalability: (1) a deep-shallow design, wherein a deep Transformer block captures most of the model representational capacity, complemented by a few shallow Transformer blocks that are computationally efficient yet substantially beneficial; (2) modeling in the latent space of pretrained autoencoders, which proves more effective than direct pixel-level modeling; and (3) a novel guidance algorithm that significantly boosts sample quality. Crucially, our model remains an end-to-end normalizing flow, enabling exact maximum likelihood training in continuous spaces without discretization. STARFlow achieves competitive performance in both class-conditional and text-conditional image generation tasks, approaching state-of-the-art diffusion models in sample quality. To our knowledge, this work is the first successful demonstration of normalizing flows operating effectively at this scale and resolution.

Summary

The paper introduces STARFlow, leveraging a deep-shallow transformer architecture and latent space learning to enhance high-resolution image synthesis.
It presents a novel guidance algorithm that improves sample quality in both class-conditioned and text-to-image generation tasks.
Empirical results demonstrate state-of-the-art performance with FID scores of 2.40 on ImageNet and 9.1 on COCO 2017.

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis

The presented paper, authored by Jiatao Gu et al., introduces STARFlow, a scalable methodology for high-resolution image synthesis using normalizing flows. STARFlow builds on the concept of Transformer Autoregressive Flow (TARFlow), which synergizes normalizing flows with autoregressive transformer architectures—a process that has shown promising results in generative modeling.

Core Innovations and Contributions

STARFlow is distinguished by its architectural and algorithmic advancements which are geared towards enhancing the scalability and performance of normalizing flows:

Deep-Shallow Architecture Design: The architecture is configured to allocate significant model capacity in deep transformer blocks, followed by computationally inexpensive shallow blocks. This strategy not only optimizes computational efficiency but also enhances modeling capabilities by focusing most parameters on the stages closest to the prior distribution.
Latent Space Learning: STARFlow departs from direct pixel modeling and instead leverages the latent space of pretrained autoencoders. This choice dramatically improves the generative quality of the model, especially for high-resolution inputs, as evidenced by empirical evaluations.
Novel Guidance Algorithm: A new guidance method is introduced, which notably improves sample quality, particularly under conditions requiring high guidance weights. This enhancement supports both class-conditioned and text-to-image generation tasks.

Theoretical Contributions

The paper lays theoretical foundations asserting the expressivity of autoregressive flows by establishing their universality in modeling continuous distributions with multiple flow blocks. The universality proposition (as sketched in Section 1.4 of the paper) offers insights into how stacked autoregressive flows can serve as a comprehensive modeling approach.

Empirical Results

The empirical part of the paper demonstrates STARFlow's competitive performance across several benchmarks in image synthesis tasks. Specifically, STARFlow achieves favorable outcomes in both class-conditioned and text-conditional image generation, rivalling state-of-the-art diffusion models despite the inherent scalability and training efficiency of normalizing flows.

ImageNet Evaluations: On the ImageNet 256x256 benchmark, STARFlow reports an FID of 2.40, demonstrating significant improvements over previous normalizing flow models such as TARFlow.
Text-to-Image Evaluations: For COCO 2017 zero-shot generation, STARFlow records an FID of 9.1, solidifying its capability in generating high-quality images conditioned on textual descriptions.

Implications and Future Directions

The advancements in STARFlow suggest substantial potential for normalizing flows in scalable, high-resolution generative tasks. Further exploration could address the integration of joint latent and NF model designs, which were highlighted as limitations. Moreover, optimizing inference speed and expanding beyond image generation to modalities such as video synthesis or 3D scene modeling are potential avenues for future research. The implications of STARFlow are profound, offering a promising alternative to traditional paradigms and opening pathways for diverse applications in AI-driven image synthesis.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1932118876650958980

https://twitter.com/HaareBlond/status/1932162494761697474

YouTube

Show All Videos