Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 96 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 35 tok/s
GPT-5 High 43 tok/s Pro
GPT-4o 106 tok/s
GPT OSS 120B 460 tok/s Pro
Kimi K2 228 tok/s Pro
2000 character limit reached

Scalable Adaptive Computation for Iterative Generation (2212.11972v2)

Published 22 Dec 2022 in cs.LG, cs.CV, and cs.NE

Abstract: Natural data is redundant yet predominant architectures tile computation uniformly across their input and output space. We propose the Recurrent Interface Networks (RINs), an attention-based architecture that decouples its core computation from the dimensionality of the data, enabling adaptive computation for more scalable generation of high-dimensional data. RINs focus the bulk of computation (i.e. global self-attention) on a set of latent tokens, using cross-attention to read and write (i.e. route) information between latent and data tokens. Stacking RIN blocks allows bottom-up (data to latent) and top-down (latent to data) feedback, leading to deeper and more expressive routing. While this routing introduces challenges, this is less problematic in recurrent computation settings where the task (and routing problem) changes gradually, such as iterative generation with diffusion models. We show how to leverage recurrence by conditioning the latent tokens at each forward pass of the reverse diffusion process with those from prior computation, i.e. latent self-conditioning. RINs yield state-of-the-art pixel diffusion models for image and video generation, scaling to 1024X1024 images without cascades or guidance, while being domain-agnostic and up to 10X more efficient than 2D and 3D U-Nets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. {{\{{TensorFlow}}\}}: a system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp.  265–283, 2016.
  2. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  3. Memory transformer. arXiv preprint arXiv:2006.11527, 2020.
  4. A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
  5. Chen, T. On the importance of noise schedules for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
  6. Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
  7. A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366, 2022a.
  8. A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022b.
  9. Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022c.
  10. Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
  11. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  12. Diffusion models beat GANs on image synthesis. In NeurIPS, 2022.
  13. Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Spatially adaptive computation time for residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1039–1048, 2017.
  16. Fukushima, K. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural networks, 1(2):119–130, 1988.
  17. Coordination among neural modules through a shared global workspace. arXiv preprint arXiv:2103.01197, 2021.
  18. Graves, A. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.
  19. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
  20. Draw: A recurrent neural network for image generation. In International conference on machine learning, pp. 1462–1471. PMLR, 2015.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  22. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  23. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  24. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  25. Denoising Diffusion Probabilistic Models. NeurIPS, 2020.
  26. Cascaded diffusion models for high fidelity image generation. JMLR, 2022a.
  27. Video Diffusion Models. In NeurIPS, 2022b.
  28. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021a.
  29. Perceiver: General perception with iterative attention. In International conference on machine learning, pp. 4651–4664. PMLR, 2021b.
  30. Inferring algorithmic patterns with stack-augmented recurrent nets. Advances in neural information processing systems, 28, 2015.
  31. Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
  32. Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
  33. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
  34. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.
  35. Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
  36. Set transformer. CoRR, abs/1810.00825, 2018. URL http://arxiv.org/abs/1810.00825.
  37. Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning, pp. 3744–3753. PMLR, 2019.
  38. Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
  39. Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2020.
  40. Improving diffusion model efficiency through patching. arXiv preprint arXiv:2207.04316, 2022.
  41. Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2022.
  42. Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672, 2021.
  43. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  44. Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
  45. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  46. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10684–10695, 2022.
  47. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp.  234–241. Springer, 2015.
  48. Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
  49. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  50. Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749, 2021.
  51. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
  52. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  53. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  54. Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022.
  55. End-to-end memory networks. Advances in neural information processing systems, 28, 2015.
  56. Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
  57. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  58. Predicting video with vqvae. arXiv preprint arXiv:2103.01950, 2021.
  59. Memory networks. arXiv preprint arXiv:1410.3916, 2014.
  60. Adavit: Adaptive tokens for efficient vision transformer. arXiv preprint arXiv:2112.07658, 2021.
  61. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
  62. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.
Citations (93)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces RINs, a novel architecture that decouples core computations from input data, enhancing scalability in generative models.
  • It employs a dual-token system where latent tokens concentrate intensive computation through global self-attention, combined with dynamic cross-attention.
  • Experiments demonstrate up to tenfold efficiency gains and superior FID and Inception Scores compared to traditional models in image and video generation.

Scalable Adaptive Computation for Iterative Generation: A Review

The paper presents an architecture called Recurrent Interface Networks (RINs), designed to optimize the computation of generative models for high-dimensional data such as images and videos. RINs differentiate themselves from conventional models by decoupling core computation from data dimensionality, which facilitates adaptive computation and improved scalability. This approach mitigates the inefficiencies seen in prevalent architectures that uniformly allocate computation across input and output spaces.

Summary of Methodology

RINs leverage attention mechanisms for processing information differentially based on task requirements. The architecture employs two categories of tokens: interface tokens, which directly relate to the input data, and latent tokens, which undergo the majority of computational processing. The bulk of computation, specifically global self-attention, occurs on the latent tokens, while cross-attention dynamically routes information between interface and latent tokens. The separation and selective focus on latent tokens allow RINs to efficiently handle large-scale data sets, making them notably more efficient than 2D and 3D U-Nets used in state-of-the-art diffusion models.

A notable aspect of the architecture is its ability to integrate bottom-up and top-down feedback loops through stacked RIN blocks. These loops enhance routing expressiveness and computational depth but also introduce challenges associated with recurrent computations. To address this, the paper introduces latent self-conditioning, where latent tokens are conditioned based on previous iterations in the reverse diffusion process. This method effectively forms a deeper network without requiring expansive latents, leading to significant efficiency gains.

Experimental Results

The architecture is tested using diffusion models for image and video generation tasks. RINs demonstrate state-of-the-art performance, scaling to 1024×10241024 \times 1024 images without necessitating cascading models or guidance and outperforming existing models by up to tenfold in efficiency. The empirical results underscore the architecture's domain-agnostic nature, providing versatility across different generative tasks while ensuring optimal allocation of computational resources.

The experiments indicate that RINs deliver better FID and Inception Scores compared to traditional and convolution-based models, supporting their effective computation allocation and scalable design. Interestingly, despite having fewer inductive biases than convolutional architectures, RINs manage to maintain competitive performance even on smaller dataset tasks such as CIFAR-10, highlighting their adaptability and robustness.

Implications and Future Directions

The proposed architecture extends the theoretical understanding of adaptive computation in high-dimensional generative modeling. Practically, it suggests alternative strategies for computational allocation in AI models, potentially reducing the need for computationally intensive techniques such as cascades or guidance that other models typically require.

The conception of latent self-conditioning presents new opportunities for optimization, suggesting avenues for future research into adaptive and scaled routing within recurrent computations. This self-conditioning mechanism enhances depth and expressiveness without compromising on computational efficiency, presenting a valuable aspect for further exploration.

RINs offer a notable contribution to generative modeling in AI, providing a scalable architecture with flexible computation allocation strategies. As we advance in the realms of generative AI, there is scope for integrating RINs with latent diffusion strategies and adaptive guidance techniques to further enhance model performance and efficiency. The findings advocate for a reconsideration of traditional evolutionary architectures in favor of models that allow better adaptability and resource efficiency tailored to data dynamism.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.