Fractal Generative Models (2502.17437v2)

Published 24 Feb 2025 in cs.LG and cs.CV

Abstract: Modularization is a cornerstone of computer science, abstracting complex functions into atomic building blocks. In this paper, we introduce a new level of modularization by abstracting generative models into atomic generative modules. Analogous to fractals in mathematics, our method constructs a new type of generative model by recursively invoking atomic generative modules, resulting in self-similar fractal architectures that we call fractal generative models. As a running example, we instantiate our fractal framework using autoregressive models as the atomic generative modules and examine it on the challenging task of pixel-by-pixel image generation, demonstrating strong performance in both likelihood estimation and generation quality. We hope this work could open a new paradigm in generative modeling and provide a fertile ground for future research. Code is available at https://github.com/LTH14/fractalgen.

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces a recursive fractal framework that composes simpler generative modules to exponentially increase output dimensionality while reducing computational costs.
It applies the approach to pixel-by-pixel image generation by partitioning high-resolution images into manageable patches that are sequentially processed.
Experiments on ImageNet demonstrate competitive results and scalability, offering actionable insights for recursive generative techniques in complex data domains.

The paper "Fractal Generative Models" (2502.17437) introduces a novel framework for constructing generative models by recursively composing simpler generative modules, drawing inspiration from the self-similar structure of mathematical fractals and natural patterns like biological neural networks and images. The core idea is to abstract a generative model itself as a modular unit and build a larger, more complex generative model by recursively invoking instances of this module. This recursive construction results in a fractal-like architecture with self-similarity across different levels of modules.

The motivation behind this approach is to efficiently model high-dimensional data with intrinsic structures that are not easily represented as one-dimensional sequences. Examples include images, molecular structures, and biological networks, which often exhibit multi-scale or hierarchical organization akin to fractals. By employing a recursive structure, the fractal framework can achieve an exponential increase in the dimensionality of generated outputs relative to the number of recursive levels, making it suitable for modeling very high-dimensional distributions using a manageable number of model levels.

The framework operates based on a recursive rule, analogous to a fractal generator. If the goal is to model a set of $N=k^n$ variables, the first level partitions the joint distribution into $k$ conditional distributions, each over $k^{n-1}$ variables. Each of these conditional distributions is then modeled by a second-level generator, recursively continuing this process for $n$ levels. Each level $i$ takes an output from the previous level $i-1$ and generates a set of outputs for level $i+1$ . The "atomic generative modules" are the parametric models used at each recursive step (e.g., neural networks).

As a concrete instantiation, the paper applies this framework to the challenging task of pixel-by-pixel image generation using autoregressive models (specifically, Transformer variants) as the atomic generative modules. Pixel-by-pixel generation is difficult due to the high dimensionality of raw pixel space (e.g., 256x256x3 pixels) and the quadratic computational complexity of standard autoregressive models with sequence length. The fractal approach addresses this by adopting a divide-and-conquer strategy. For a 256x256 image, the first level might model 16x16 patches (256 patches), the second level models 4x4 sub-patches within each 16x16 patch (16 sub-patches per patch), the third level models 4x4 sub-patches within each 4x4 sub-patch (16 sub-patches per sub-patch), and the final level models the 3 RGB channels within each pixel autoregressively.

In this image generation instantiation, each autoregressive module at a given fractal level receives output from the previous level's generator (providing context) and image patches corresponding to its scope. It processes these inputs using Transformer blocks and produces outputs for the next level's generators. The sequence length within each individual Transformer remains small and manageable (e.g., 256 for the first level, 16 for lower levels), contrasting sharply with the prohibitively long sequences required for a single-level autoregressive model of raw pixels. This dramatically reduces the computational cost, particularly the attention mechanism's quadratic complexity, enabling pixel-by-pixel generation for high-resolution images. For example, modeling a 256x256 image with a 4-level fractal structure involves attention over sequences of length 256 or 16, instead of the full pixel count ( $\approx 65k$ ) or final patch count ( $\approx 4k$ ). The paper highlights that modeling a 256x256 image with their fractal design is only twice as computationally expensive as modeling a 64x64 image.

The paper contrasts its approach with related work:

Hierarchical Representations: While using hierarchical structures like SPPNet or FPN, these typically lack the recursive self-similarity and divide-and-conquer generation process central to the fractal model.
Hierarchical Generative Models: Two-stage models (like VQ-VAE + Transformer/Diffusion) rely on a tokenizer, which can introduce reconstruction errors. Cascaded diffusion models generate images scale-by-scale but often don't model raw pixels directly and typically lack the recursive sub-module structure.
Scale-space Autoregressive Models: These predict tokens scale-by-scale with a single large model, often incurring high attention costs for the entire sequence at each scale, unlike the fractal approach which distributes computation across smaller sequences within sub-modules.
Long-Sequence Modeling: Traditional pixel AR models (like PixelRNN, PixelCNN, Perceiver AR, MegaByte) treat the image as a 1D sequence, which is often unnatural and computationally expensive for high resolutions. The fractal approach treats the data as a set structure and recursively models subsets.
Modular Architectures: While modular (like ResNet, Transformer blocks), previous work like FractalNet applied recursion to smaller network blocks for classification, not composing entire generative models for high-dimensional synthesis with exponential output growth.

The implementation trains the entire fractal model end-to-end using a breadth-first traversal during the forward pass (for loss computation) and generates samples using a depth-first traversal. Image patches and outputs from previous levels are used as inputs to subsequent level modules. Techniques like adding a "guiding pixel" (average patch value) and incorporating outputs from surrounding patches help improve generation quality and boundary consistency. Both causal (AR) and masked (MAR) autoregressive variants were explored, with MAR showing stronger empirical performance due to its ability to predict multiple tokens in parallel within a module. Classifier-Free Guidance (CFG) and temperature scaling are used for conditional generation.

Experiments on ImageNet 64x64 demonstrate that increasing the number of fractal levels improves likelihood estimation (NLL) and computational efficiency compared to shallower or non-fractal AR baselines. On ImageNet 256x256, the fractal MAR model achieves competitive generation quality (FID, IS, Precision, Recall) against state-of-the-art GANs and diffusion models, notably being the only method generating raw pixels directly among the compared strong baselines. While its FID and Recall are slightly lower than top diffusion/GAN models, its Inception Score and Precision are very high, indicating high fidelity. The paper also shows promising scaling trends, suggesting potential for further improvements with larger models. Conditional pixel-by-pixel prediction tasks like inpainting, outpainting, uncropping, and class-conditional editing are demonstrated, highlighting the model's ability to predict unknown pixels based on context and demonstrating a more interpretable, element-by-element generation process.

In conclusion, Fractal Generative Models propose a novel architectural paradigm inspired by fractals, enabling efficient and effective modeling of high-dimensional, non-sequential data by recursively composing generative modules. The pixel-by-pixel image generation instantiation showcases its capability to tackle challenging tasks where traditional sequential or non-recursive hierarchical models struggle computationally. The work opens avenues for future research in designing and applying fractal structures to various data domains beyond images. The authors acknowledge potential negative societal consequences similar to other generative models, such as the misuse for disinformation or bias amplification.