Scalable Adaptive Computation for Iterative Generation (2212.11972v2)
Abstract: Natural data is redundant yet predominant architectures tile computation uniformly across their input and output space. We propose the Recurrent Interface Networks (RINs), an attention-based architecture that decouples its core computation from the dimensionality of the data, enabling adaptive computation for more scalable generation of high-dimensional data. RINs focus the bulk of computation (i.e. global self-attention) on a set of latent tokens, using cross-attention to read and write (i.e. route) information between latent and data tokens. Stacking RIN blocks allows bottom-up (data to latent) and top-down (latent to data) feedback, leading to deeper and more expressive routing. While this routing introduces challenges, this is less problematic in recurrent computation settings where the task (and routing problem) changes gradually, such as iterative generation with diffusion models. We show how to leverage recurrence by conditioning the latent tokens at each forward pass of the reverse diffusion process with those from prior computation, i.e. latent self-conditioning. RINs yield state-of-the-art pixel diffusion models for image and video generation, scaling to 1024X1024 images without cascades or guidance, while being domain-agnostic and up to 10X more efficient than 2D and 3D U-Nets.
- {{\{{TensorFlow}}\}}: a system for {{\{{Large-Scale}}\}} machine learning. In 12th USENIX symposium on operating systems design and implementation (OSDI 16), pp. 265–283, 2016.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- Memory transformer. arXiv preprint arXiv:2006.11527, 2020.
- A short note about kinetics-600. arXiv preprint arXiv:1808.01340, 2018.
- Chen, T. On the importance of noise schedules for diffusion models. arXiv preprint arXiv:2301.10972, 2023.
- Pix2seq: A language modeling framework for object detection. arXiv preprint arXiv:2109.10852, 2021.
- A generalist framework for panoptic segmentation of images and videos. arXiv preprint arXiv:2210.06366, 2022a.
- A unified sequence interface for vision tasks. arXiv preprint arXiv:2206.07669, 2022b.
- Analog bits: Generating discrete data using diffusion models with self-conditioning. arXiv preprint arXiv:2208.04202, 2022c.
- Adversarial video generation on complex datasets. arXiv preprint arXiv:1907.06571, 2019.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Diffusion models beat GANs on image synthesis. In NeurIPS, 2022.
- Continuous diffusion for categorical data. arXiv preprint arXiv:2211.15089, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Spatially adaptive computation time for residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1039–1048, 2017.
- Fukushima, K. Neocognitron: A hierarchical neural network capable of visual pattern recognition. Neural networks, 1(2):119–130, 1988.
- Coordination among neural modules through a shared global workspace. arXiv preprint arXiv:2103.01197, 2021.
- Graves, A. Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983, 2016.
- Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
- Draw: A recurrent neural network for image generation. In International conference on machine learning, pp. 1462–1471. PMLR, 2015.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Denoising Diffusion Probabilistic Models. NeurIPS, 2020.
- Cascaded diffusion models for high fidelity image generation. JMLR, 2022a.
- Video Diffusion Models. In NeurIPS, 2022b.
- Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021a.
- Perceiver: General perception with iterative attention. In International conference on machine learning, pp. 4651–4664. PMLR, 2021b.
- Inferring algorithmic patterns with stack-augmented recurrent nets. Advances in neural information processing systems, 28, 2015.
- Elucidating the design space of diffusion-based generative models. arXiv preprint arXiv:2206.00364, 2022.
- Variational diffusion models. Advances in neural information processing systems, 34:21696–21707, 2021.
- Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
- Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.
- Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
- Set transformer. CoRR, abs/1810.00825, 2018. URL http://arxiv.org/abs/1810.00825.
- Set transformer: A framework for attention-based permutation-invariant neural networks. In International conference on machine learning, pp. 3744–3753. PMLR, 2019.
- Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
- Transformation-based adversarial video prediction on large-scale data. arXiv preprint arXiv:2003.04035, 2020.
- Improving diffusion model efficiency through patching. arXiv preprint arXiv:2207.04316, 2022.
- Transframer: Arbitrary frame prediction with generative models. arXiv preprint arXiv:2203.09494, 2022.
- Improved denoising diffusion probabilistic models. arXiv preprint arXiv:2102.09672, 2021.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Scalable diffusion models with transformers. arXiv preprint arXiv:2212.09748, 2022.
- Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Springer, 2015.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
- Step-unrolled denoising autoencoders for text generation. arXiv preprint arXiv:2112.06749, 2021.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp. 2256–2265. PMLR, 2015.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
- Self-conditioned embedding diffusion for text generation. arXiv preprint arXiv:2211.04236, 2022.
- End-to-end memory networks. Advances in neural information processing systems, 28, 2015.
- Towards accurate generative models of video: A new metric & challenges. arXiv preprint arXiv:1812.01717, 2018.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Predicting video with vqvae. arXiv preprint arXiv:2103.01950, 2021.
- Memory networks. arXiv preprint arXiv:1410.3916, 2014.
- Adavit: Adaptive tokens for efficient vision transformer. arXiv preprint arXiv:2112.07658, 2021.
- Large batch optimization for deep learning: Training bert in 76 minutes. arXiv preprint arXiv:1904.00962, 2019.
- Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33:17283–17297, 2020.
Collections
Sign up for free to add this paper to one or more collections.