A Survey of Deep Learning: From Activations to Transformers

Published 1 Feb 2023 in cs.LG and cs.AI | (2302.00722v3)

Abstract: Deep learning has made tremendous progress in the last decade. A key success factor is the large amount of architectures, layers, objectives, and optimization techniques. They include a myriad of variants related to attention, normalization, skip connections, transformers and self-supervised learning schemes -- to name a few. We provide a comprehensive overview of the most important, recent works in these areas to those who already have a basic understanding of deep learning. We hope that a holistic and unified treatment of influential, recent works helps researchers to form new connections between diverse areas of deep learning. We identify and discuss multiple patterns that summarize the key strategies for many of the successful innovations over the last decade as well as works that can be seen as rising stars. We also include a discussion on recent commercially built, closed-source models such as OpenAI's GPT-4 and Google's PaLM 2.

Abstract PDF HTML Upgrade to Chat

References (86)

Citations (9)

View on Semantic Scholar

Summary

The paper provides an integrated review of transformer models and activation functions, elucidating their evolution and cross-domain applications.
The paper details state-of-the-art methodologies including novel loss functions, optimization techniques, and self-supervised learning strategies that enhance model performance.
The paper explores architectural innovations, such as multi-head attention and skip connections, which drive advancements in both natural language processing and computer vision.

A Survey of Deep Learning: From Activations to Transformers

Introduction

The landscape of deep learning (DL) has been significantly transformed over the past decade, driven by innovations in architectures, training methodologies, and core components such as activation functions and attention mechanisms. This survey offers a comprehensive overview of these advances, targeting researchers familiar with the foundational concepts of DL, aiming to integrate recent influential works into a cohesive narrative. The discussion spans the breadth of modern DL, highlighting emerging patterns and potential future research trajectories.

Overview of Deep Learning

The evolution of DL is characterized by the iterative enhancement of components like objectives and optimization techniques, along with architecture-specific innovations. The exponential growth in applications, from computer vision to NLP, underscores the role of shared design concepts across disciplines. Notably, techniques initially developed for one domain often find utility in others, exemplified by the migration of innovations like CNNs and Transformers across different problem spaces (Figure 1).

Figure 1: Categorization of deep learning and areas covered in the survey.

Loss Functions and Optimization Techniques

Loss functions and optimization strategies form the backbone of DL advancements. The paper elaborates on specific loss functions such as Triplet Loss, Focal Loss, and Cycle Consistency Loss, each playing a pivotal role in enhancing model performance by focusing learning on task-specific requirements. Optimization advancements, like Adafactor and LAMB, aim to reduce computational costs while maintaining efficacy, illustrating the necessity for efficient resource utilization in training large models.

Self, Semi-supervised, and Contrastive Learning

The survey highlights the paradigm shift towards leveraging unlabeled data via self-supervised, semi-supervised, and contrastive learning techniques. Methods like SimCLR and BYOL exemplify how such approaches reduce labeling costs while achieving performances near supervised learning. These techniques often form the pre-training phase of large models, setting the stage for effective fine-tuning on smaller labeled datasets.

Architectures and Layers

Key architectural innovations discussed include advancements in attention mechanisms and novel layer designs like normalized and skip connections. Attention mechanisms, such as the Scaled Dot-Product Multi-Head Attention (Figure 2), underpin current state-of-the-art architectures due to their ability to dynamically focus on relevant input segments. Furthermore, novel skip connections enhance gradient flow, facilitating the training of deeper networks.

Figure 2: Transformer with the four basic blocks on top and the encoder and decoder at the bottom.

Transformer Architectures

Transformers have reshaped DL, particularly in NLP. Innovations from BERT to GPT models showcase the potential for large-scale unsupervised pre-training followed by task-specific fine-tuning. These models move beyond traditional architectures by leveraging multi-head attention and layer normalization. The survey discusses transformative advancements like multi-modal processing and improved training efficiencies in models like ChatGPT and GPT-4, which incorporate vast data and complex interactions.

Graph Neural Networks

This domain's representation leverages architectures like Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT), which extend the principles of DL to structured data. These networks exemplify architecture adaptation and highlight the potential overlap with non-traditional domains of DL, indicating further research opportunities in cross-domain architecture application.

Discussion

The survey identifies essential patterns such as "Multi-X" (parallel component usage) and "Higher order layers" (more complex data transformations), illustrating recurring themes in successful DL innovations. These patterns suggest that future progress may rely not solely on entirely novel architectures but on the strategic combination and refinement of existing components. The paper also notes the role of self-supervised learning in scaling networks effectively.

Conclusions

The survey concludes that while recent years have seen incremental advancements, there remains a space for radical innovation. The transformative impact of models like Transformers underlines the importance of strategic experimentation with DL components, setting the stage for future breakthroughs in AI. By integrating substantial works and noting effective design patterns, this survey aids researchers in devising new, potentially groundbreaking methodologies.

Markdown