Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion (2405.16098v1)
Abstract: The Transformer architecture has dominated machine learning in a wide range of tasks. The specific characteristic of this architecture is an expensive scaled dot-product attention mechanism that models the inter-token interactions, which is known to be the reason behind its success. However, such a mechanism does not have a direct parallel to the human brain which brings the question if the scaled-dot product is necessary for intelligence with strong expressive power. Inspired by the lateralization of the human brain, we propose a new simple but effective architecture called the Lateralization MLP (L-MLP). Stacking L-MLP blocks can generate complex architectures. Each L-MLP block is based on a multi-layer perceptron (MLP) that permutes data dimensions, processes each dimension in parallel, merges them, and finally passes through a joint MLP. We discover that this specific design outperforms other MLP variants and performs comparably to a transformer-based architecture in the challenging diffusion task while being highly efficient. We conduct experiments using text-to-image generation tasks to demonstrate the effectiveness and efficiency of L-MLP. Further, we look into the model behavior and discover a connection to the function of the human brain. Our code is publicly available: \url{https://github.com/zizhao-hu/L-MLP}
- All are worth words: A vit backbone for diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22669–22679, 2022.
- Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18009–18019, 2021.
- Cyclemlp: A mlp-like architecture for dense prediction. ArXiv, abs/2107.10224, 2021.
- Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers, 2024.
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3156–3164, 2020.
- Scaling rectified flow transformers for high-resolution image synthesis. ArXiv, abs/2403.03206, 2024.
- Vector quantized diffusion model for text-to-image synthesis. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10686–10696, 2021.
- Neighborhood attention transformer. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6185–6194, 2022.
- Clipscore: A reference-free evaluation metric for image captioning. ArXiv, abs/2104.08718, 2021.
- Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Neural Information Processing Systems, 2017.
- Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
- Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:1328–1334, 2021.
- Unit: Multimodal multitask learning with a unified transformer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1419–1429, 2021.
- An intermediate fusion vit enables efficient text-image alignment in diffusion models, 2024.
- Perceiver: General perception with iterative attention. ArXiv, abs/2103.03206, 2021.
- Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
- Pay attention to mlps. In Neural Information Processing Systems, 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
- Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neural Information Processing Systems, 2019.
- Scalable diffusion models with transformers. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2022.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- Marcus E. Raichle. The brain’s default mode network. Annual Review of Neuroscience, 38:433–447, July 2015. Publisher Copyright: © 2015 by Annual Reviews. All rights reserved.
- High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
- U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015.
- Mlp-mixer: An all-mlp architecture for vision. ArXiv, abs/2105.01601, 2021.
- Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:5314–5321, 2021.
- Scaling local self-attention for parameter efficient visual backbones. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12889–12899, 2021.
- Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Understanding and improving layer normalization. ArXiv, abs/1911.07013, 2019.
- Your vit is secretly a hybrid discriminative-generative diffusion model. ArXiv, abs/2208.07791, 2022.
- S2-mlp: Spatial-shift mlp architecture for vision. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3615–3624, 2021.
- Cross-modal contrastive learning for text-to-image generation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 833–842, 2021.
- Lafite: Towards language-free training for text-to-image generation. arXiv preprint arXiv:2111.13792, 2021.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5795–5803, 2019.
- Zizhao Hu (10 papers)
- Mohammad Rostami (64 papers)