Hierarchical Autoregressive Image Models with Auxiliary Decoders (1903.04933v2)

Published 6 Mar 2019 in cs.CV, cs.LG, and stat.ML

Abstract: Autoregressive generative models of images tend to be biased towards capturing local structure, and as a result they often produce samples which are lacking in terms of large-scale coherence. To address this, we propose two methods to learn discrete representations of images which abstract away local detail. We show that autoregressive models conditioned on these representations can produce high-fidelity reconstructions of images, and that we can train autoregressive priors on these representations that produce samples with large-scale coherence. We can recursively apply the learning procedure, yielding a hierarchy of progressively more abstract image representations. We train hierarchical class-conditional autoregressive models on the ImageNet dataset and demonstrate that they are able to generate realistic images at resolutions of 128$\times$128 and 256$\times$256 pixels. We also perform a human evaluation study comparing our models with both adversarial and likelihood-based state-of-the-art generative models.

Citations (37)

View on Semantic Scholar

Summary

The paper introduces innovative methods to learn abstract image representations that capture long-range dependencies and improve global coherence.
It combines feed-forward decoders and masked self-prediction to shift the model's focus from local details to large-scale structures.
Experiments on ImageNet demonstrate enhanced image fidelity through state-of-the-art performance measured by IS, FID, and human evaluations.

Hierarchical Autoregressive Image Models with Auxiliary Decoders: A Detailed Analysis

The paper "Hierarchical Autoregressive Image Models with Auxiliary Decoders" addresses a significant limitation in autoregressive generative models of images: their propensity to focus on local structure at the expense of large-scale coherence. Two novel methodologies are introduced to generate discrete image representations that abstract away local details, thereby enabling hierarchical modeling and facilitating the generation of high-fidelity, coherent images. The research demonstrates the application of these hierarchical autoregressive models conditioned on these learned representations to produce state-of-the-art generative results, particularly when evaluated against adversarial and likelihood-based models.

Key Contributions

Problem Identification: Previous autoregressive models like PixelCNN are inherently biased towards capturing local detail due to their architectural design and likelihood loss function. This tendency results in image samples lacking large-scale coherence, which is often critical for human-centered evaluations.
Hierarchical Representation Learning: The paper introduces two strategies for learning image representations that abstract from local detail. These help shift model focus to more significant structures:

a. Feed-forward Decoders: Utilize a conventional autoencoder architecture where reconstruction from latent representations is optimized via auxiliary decoders. However, this can lead the model to capture excessive detail, which is not ideal when the objective is large-scale coherence.

b. Masked Self-Prediction (MSP): Encourages representations that rely on long-range dependencies by masking parts of the input during training, forcing the model to learn representations that are less about local pixel detail and more about global features. The representations learned through MSP enable constructing a multi-level hierarchy of image features.
Hierarchical Autoregressive Priors: Autoregressive models are trained on these abstracted representations, leading to significant improvements in sample quality at resolutions of 128x128 and 256x256 pixels when trained on ImageNet.

Experimental Validation

Human Evaluations: The models were subject to perceptual evaluation through human ratings. The hierarchical models produced images with substantial large-scale coherence, akin to current state-of-the-art adversarial models, showcasing the effectiveness of these representations.
Metric Comparisons: The models were evaluated using standard benchmarks like Inception Score (IS) and Frechet Inception Distance (FID), highlighting their competitive performance in generating high-fidelity images.
Sampling Efficiency: Despite improvements in coherence and detail through these hierarchical models, the sampling time remains feasible, although incremental improvements in sampling techniques such as buffering would enhance real-world applicability.

Theoretical and Practical Implications

The ability to abstract away local details while retaining necessary high-level information has implications beyond image generation. It suggests potential enhancements in tasks requiring hierarchical understanding, such as video synthesis or scene understanding in autonomous systems.

The specification of hierarchical autoencoders allows for assigning varied levels of abstraction to different levels of a neural model, opening avenues for more structured and interpretable generative processes. By using auxiliary decoders to guide this abstraction, one can circumnavigate issues like posterior collapse, to which standard VAEs are susceptible.

Future Prospects

The proposed methodologies hold promise for broad applications in generative tasks across modalities, including audio and text. Future explorations may focus on the differential advantages of feed-forward versus MSP decoders in more nuanced settings or leverage the hierarchical approach to drive innovations in unsupervised feature learning or domain adaptation.

This work paves the way for further refinements in likelihood-based generative modeling, potentially bridging the gap with adversarial methodologies, and offering robust alternatives where adversarial training is challenging or infeasible. The research substantiates a significant step towards models capable of balancing constraints of detail and coherence, thereby broadening the horizons for AI in creative and analytical domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/sedielem/status/1890721072611217781