Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lateralization MLP: A Simple Brain-inspired Architecture for Diffusion (2405.16098v1)

Published 25 May 2024 in cs.CV

Abstract: The Transformer architecture has dominated machine learning in a wide range of tasks. The specific characteristic of this architecture is an expensive scaled dot-product attention mechanism that models the inter-token interactions, which is known to be the reason behind its success. However, such a mechanism does not have a direct parallel to the human brain which brings the question if the scaled-dot product is necessary for intelligence with strong expressive power. Inspired by the lateralization of the human brain, we propose a new simple but effective architecture called the Lateralization MLP (L-MLP). Stacking L-MLP blocks can generate complex architectures. Each L-MLP block is based on a multi-layer perceptron (MLP) that permutes data dimensions, processes each dimension in parallel, merges them, and finally passes through a joint MLP. We discover that this specific design outperforms other MLP variants and performs comparably to a transformer-based architecture in the challenging diffusion task while being highly efficient. We conduct experiments using text-to-image generation tasks to demonstrate the effectiveness and efficiency of L-MLP. Further, we look into the model behavior and discover a connection to the function of the human brain. Our code is publicly available: \url{https://github.com/zizhao-hu/L-MLP}

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. All are worth words: A vit backbone for diffusion models. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22669–22679, 2022.
  2. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18009–18019, 2021.
  3. Cyclemlp: A mlp-like architecture for dense prediction. ArXiv, abs/2107.10224, 2021.
  4. Scalable high-resolution pixel-space image synthesis with hourglass diffusion transformers, 2024.
  5. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3156–3164, 2020.
  6. Scaling rectified flow transformers for high-resolution image synthesis. ArXiv, abs/2403.03206, 2024.
  7. Vector quantized diffusion model for text-to-image synthesis. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10686–10696, 2021.
  8. Neighborhood attention transformer. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6185–6194, 2022.
  9. Clipscore: A reference-free evaluation metric for image captioning. ArXiv, abs/2104.08718, 2021.
  10. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Neural Information Processing Systems, 2017.
  11. Jonathan Ho. Classifier-free diffusion guidance. ArXiv, abs/2207.12598, 2022.
  12. Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:1328–1334, 2021.
  13. Unit: Multimodal multitask learning with a unified transformer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1419–1429, 2021.
  14. An intermediate fusion vit enables efficient text-image alignment in diffusion models, 2024.
  15. Perceiver: General perception with iterative attention. ArXiv, abs/2103.03206, 2021.
  16. Microsoft coco: Common objects in context. In European Conference on Computer Vision, 2014.
  17. Pay attention to mlps. In Neural Information Processing Systems, 2021.
  18. Swin transformer: Hierarchical vision transformer using shifted windows. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9992–10002, 2021.
  19. Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps. arXiv preprint arXiv:2206.00927, 2022.
  20. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Neural Information Processing Systems, 2019.
  21. Scalable diffusion models with transformers. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4172–4182, 2022.
  22. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  23. Marcus E. Raichle. The brain’s default mode network. Annual Review of Neuroscience, 38:433–447, July 2015. Publisher Copyright: © 2015 by Annual Reviews. All rights reserved.
  24. High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021.
  25. U-net: Convolutional networks for biomedical image segmentation. ArXiv, abs/1505.04597, 2015.
  26. Mlp-mixer: An all-mlp architecture for vision. ArXiv, abs/2105.01601, 2021.
  27. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45:5314–5321, 2021.
  28. Scaling local self-attention for parameter efficient visual backbones. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12889–12899, 2021.
  29. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  30. Understanding and improving layer normalization. ArXiv, abs/1911.07013, 2019.
  31. Your vit is secretly a hybrid discriminative-generative diffusion model. ArXiv, abs/2208.07791, 2022.
  32. S2-mlp: Spatial-shift mlp architecture for vision. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3615–3624, 2021.
  33. Cross-modal contrastive learning for text-to-image generation. 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 833–842, 2021.
  34. Lafite: Towards language-free training for text-to-image generation. arXiv preprint arXiv:2111.13792, 2021.
  35. Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5795–5803, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Zizhao Hu (10 papers)
  2. Mohammad Rostami (64 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com