Papers
Topics
Authors
Recent
Search
2000 character limit reached

A Survey of Deep Learning: From Activations to Transformers

Published 1 Feb 2023 in cs.LG and cs.AI | (2302.00722v3)

Abstract: Deep learning has made tremendous progress in the last decade. A key success factor is the large amount of architectures, layers, objectives, and optimization techniques. They include a myriad of variants related to attention, normalization, skip connections, transformers and self-supervised learning schemes -- to name a few. We provide a comprehensive overview of the most important, recent works in these areas to those who already have a basic understanding of deep learning. We hope that a holistic and unified treatment of influential, recent works helps researchers to form new connections between diverse areas of deep learning. We identify and discuss multiple patterns that summarize the key strategies for many of the successful innovations over the last decade as well as works that can be seen as rising stars. We also include a discussion on recent commercially built, closed-source models such as OpenAI's GPT-4 and Google's PaLM 2.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (86)
  1. A state-of-the-art survey on deep learning theory and architectures. electronics.
  2. Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of big Data.
  3. A survey on modern trainable activation functions. Neural Networks.
  4. Layer normalization. arXiv:1607.06450.
  5. Tucker: Tensor factorization for knowledge graph completion. arXiv:1901.09590.
  6. Longformer: The long-document transformer. arXiv:2004.05150.
  7. A general survey on attention mechanisms in deep learning. Transactions on Knowledge and Data Engineering.
  8. Language models are few-shot learners. Advances in neural information processing systems.
  9. A simple framework for contrastive learning of visual representations. In Int. Conf. on machine learning.
  10. Generating long sequences with sparse transformers. arXiv:1904.10509.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
  13. A survey on deep learning and its applications. Computer Science Review.
  14. Triplet loss in siamese network for object tracking. In European Conf. on computer vision (ECCV).
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929.
  16. A generalization of transformer networks to graphs. arXiv:2012.09699.
  17. Self-supervised representation learning: Introduction, advances, and challenges. Signal Processing Magazine.
  18. Sharpness-aware minimization for efficiently improving generalization. arXiv:2010.01412.
  19. Dropblock: A regularization method for convolutional networks. Advances in neural information processing systems.
  20. Google (2023). Palm 2 technical report. https://ai.google/static/documents/palm2techreport.pdf.
  21. Bootstrap your own latent-a new approach to self-supervised learning. Adv. in neural information processing systems.
  22. node2vec: Scalable feature learning for networks. In ACM SIGKDD Int. Conf. on Knowledge discovery and data mining.
  23. Visual attention network. arXiv:2202.09741.
  24. Attention mechanisms in computer vision: A survey. Computational Visual Media.
  25. A survey on vision transformer. transactions on pattern analysis and machine intelligence.
  26. Momentum contrast for unsupervised visual representation learning. In Conf. on computer vision and pattern recognition.
  27. Deep residual learning for image recognition. In Conf. on computer vision and pattern recognition.
  28. Gaussian error linear units (gelus). arXiv:1606.08415.
  29. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems.
  30. Densely connected convolutional networks. In Conf. on computer vision and pattern recognition.
  31. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Int. Conf. on machine learning.
  32. Averaging weights leads to wider optima and better generalization. arXiv:1803.05407.
  33. Analyzing and improving the image quality of stylegan. In Conf. on computer vision and pattern recognition.
  34. A survey of the recent architectures of deep convolutional neural networks. Artificial intelligence review.
  35. Transformers in vision: A survey. ACM computing surveys (CSUR).
  36. Supervised contrastive learning. Advances in neural information processing systems.
  37. Stochastic estimation of the maximum of a regression function. The Annals of Mathematical Statistics.
  38. Adam: A method for stochastic optimization. arXiv:1412.6980.
  39. Semi-supervised classification with graph convolutional networks. arXiv:1609.02907.
  40. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880.
  41. Focal loss for dense object detection. In Int. Conf. on computer vision.
  42. On the variance of the adaptive learning rate and beyond. arXiv:1908.03265.
  43. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys.
  44. Generating wikipedia by summarizing long sequences. arXiv:1801.10198.
  45. Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692.
  46. Swin transformer: Hierarchical vision transformer using shifted windows. In Int. Conf. on computer vision.
  47. Decoupled weight decay regularization. arXiv:1711.05101.
  48. Which training methods for GANs do actually converge? In Int. Conf. on machine learning.
  49. Recent advances in natural language processing via large pre-trained language models: A survey. arXiv preprint arXiv:2111.01243.
  50. Misra, D. (2019). Mish: A self regularized non-monotonic activation function. arXiv:1908.08681.
  51. Asynchronous methods for deep reinforcement learning. In Int. Conf. on machine learning.
  52. A survey of regularization strategies for deep models. Artificial Intelligence Review.
  53. Rectified linear units improve restricted boltzmann machines. In Int. Conf. on machine learning (ICML-).
  54. OpenAI (2022). Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt/.
  55. OpenAI (2023). Gpt-4 technical report.
  56. Training language models to follow instructions with human feedback. arXiv:2203.02155.
  57. Improving language understanding by generative pre-training.
  58. Language models are unsupervised multitask learners. OpenAI blog.
  59. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research.
  60. Searching for activation functions. arXiv preprint arXiv:1710.05941.
  61. On the convergence of adam and beyond. arXiv:1904.09237.
  62. Mobilenetv2: Inverted residuals and linear bottlenecks. In Conf. on computer vision and pattern recognition.
  63. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.
  64. Learning a distance metric from relative comparisons. Advances in neural information processing systems, 16.
  65. Is normalization indispensable for training deep neural network? Advances in Neural Information Processing Systems.
  66. Adafactor: Adaptive learning rates with sublinear memory cost. In Int. Conf. on Machine Learning.
  67. Review of deep learning algorithms and architectures. IEEE access.
  68. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems.
  69. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research.
  70. Sun, R.-Y. (2020). Optimization for deep learning: An overview. Operations Research Society of China.
  71. Rotate: Knowledge graph embedding by relational rotation in complex space. arXiv:1902.10197.
  72. Going deeper with image transformers. In Int. Conf. on Computer Vision.
  73. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022.
  74. Attention is all you need. Advances in neural information processing systems.
  75. Graph attention networks. arXiv:1710.10903.
  76. Residual attention network for image classification. In Conf. on computer vision and pattern recognition.
  77. A comprehensive survey of loss functions in machine learning. Annals of Data Science.
  78. Function optimization using connectionist reinforcement learning algorithms. Connection Science.
  79. A comprehensive survey on graph neural networks. Transactions on neural networks and learning systems.
  80. Self-training with noisy student improves imagenet classification. In Conf. on computer vision and pattern recognition.
  81. Aggregated residual transformations for deep neural networks. In Conf. on computer vision and pattern recognition.
  82. A survey on deep semi-supervised learning. Transactions on Knowledge and Data Engineering.
  83. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems.
  84. Large batch optimization for deep learning: Training bert in 76 minutes. arXiv:1904.00962.
  85. Barlow twins: Self-supervised learning via redundancy reduction. In Int. Conf. on Machine Learning.
  86. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Int. Conf. on computer vision.
Citations (9)

Summary

  • The paper provides an integrated review of transformer models and activation functions, elucidating their evolution and cross-domain applications.
  • The paper details state-of-the-art methodologies including novel loss functions, optimization techniques, and self-supervised learning strategies that enhance model performance.
  • The paper explores architectural innovations, such as multi-head attention and skip connections, which drive advancements in both natural language processing and computer vision.

A Survey of Deep Learning: From Activations to Transformers

Introduction

The landscape of deep learning (DL) has been significantly transformed over the past decade, driven by innovations in architectures, training methodologies, and core components such as activation functions and attention mechanisms. This survey offers a comprehensive overview of these advances, targeting researchers familiar with the foundational concepts of DL, aiming to integrate recent influential works into a cohesive narrative. The discussion spans the breadth of modern DL, highlighting emerging patterns and potential future research trajectories.

Overview of Deep Learning

The evolution of DL is characterized by the iterative enhancement of components like objectives and optimization techniques, along with architecture-specific innovations. The exponential growth in applications, from computer vision to NLP, underscores the role of shared design concepts across disciplines. Notably, techniques initially developed for one domain often find utility in others, exemplified by the migration of innovations like CNNs and Transformers across different problem spaces (Figure 1). Figure 1

Figure 1: Categorization of deep learning and areas covered in the survey.

Loss Functions and Optimization Techniques

Loss functions and optimization strategies form the backbone of DL advancements. The paper elaborates on specific loss functions such as Triplet Loss, Focal Loss, and Cycle Consistency Loss, each playing a pivotal role in enhancing model performance by focusing learning on task-specific requirements. Optimization advancements, like Adafactor and LAMB, aim to reduce computational costs while maintaining efficacy, illustrating the necessity for efficient resource utilization in training large models.

Self, Semi-supervised, and Contrastive Learning

The survey highlights the paradigm shift towards leveraging unlabeled data via self-supervised, semi-supervised, and contrastive learning techniques. Methods like SimCLR and BYOL exemplify how such approaches reduce labeling costs while achieving performances near supervised learning. These techniques often form the pre-training phase of large models, setting the stage for effective fine-tuning on smaller labeled datasets.

Architectures and Layers

Key architectural innovations discussed include advancements in attention mechanisms and novel layer designs like normalized and skip connections. Attention mechanisms, such as the Scaled Dot-Product Multi-Head Attention (Figure 2), underpin current state-of-the-art architectures due to their ability to dynamically focus on relevant input segments. Furthermore, novel skip connections enhance gradient flow, facilitating the training of deeper networks. Figure 2

Figure 2: Transformer with the four basic blocks on top and the encoder and decoder at the bottom.

Transformer Architectures

Transformers have reshaped DL, particularly in NLP. Innovations from BERT to GPT models showcase the potential for large-scale unsupervised pre-training followed by task-specific fine-tuning. These models move beyond traditional architectures by leveraging multi-head attention and layer normalization. The survey discusses transformative advancements like multi-modal processing and improved training efficiencies in models like ChatGPT and GPT-4, which incorporate vast data and complex interactions.

Graph Neural Networks

This domain's representation leverages architectures like Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), which extend the principles of DL to structured data. These networks exemplify architecture adaptation and highlight the potential overlap with non-traditional domains of DL, indicating further research opportunities in cross-domain architecture application.

Discussion

The survey identifies essential patterns such as "Multi-X" (parallel component usage) and "Higher order layers" (more complex data transformations), illustrating recurring themes in successful DL innovations. These patterns suggest that future progress may rely not solely on entirely novel architectures but on the strategic combination and refinement of existing components. The paper also notes the role of self-supervised learning in scaling networks effectively.

Conclusions

The survey concludes that while recent years have seen incremental advancements, there remains a space for radical innovation. The transformative impact of models like Transformers underlines the importance of strategic experimentation with DL components, setting the stage for future breakthroughs in AI. By integrating substantial works and noting effective design patterns, this survey aids researchers in devising new, potentially groundbreaking methodologies.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.