Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mish: A Self Regularized Non-Monotonic Activation Function (1908.08681v3)

Published 23 Aug 2019 in cs.LG, cs.CV, cs.NE, and stat.ML

Abstract: We propose $\textit{Mish}$, a novel self-regularized non-monotonic activation function which can be mathematically defined as: $f(x)=x\tanh(softplus(x))$. As activation functions play a crucial role in the performance and training dynamics in neural networks, we validated experimentally on several well-known benchmarks against the best combinations of architectures and activation functions. We also observe that data augmentation techniques have a favorable effect on benchmarks like ImageNet-1k and MS-COCO across multiple architectures. For example, Mish outperformed Leaky ReLU on YOLOv4 with a CSP-DarkNet-53 backbone on average precision ($AP_{50}{val}$) by 2.1$\%$ in MS-COCO object detection and ReLU on ResNet-50 on ImageNet-1k in Top-1 accuracy by $\approx$1$\%$ while keeping all other network parameters and hyperparameters constant. Furthermore, we explore the mathematical formulation of Mish in relation with the Swish family of functions and propose an intuitive understanding on how the first derivative behavior may be acting as a regularizer helping the optimization of deep neural networks. Code is publicly available at https://github.com/digantamisra98/Mish.

Citations (658)

Summary

  • The paper presents Mish, a novel activation function that outperforms traditional functions like ReLU and Swish in accuracy.
  • The paper employs extensive experiments on datasets such as CIFAR-10 and ImageNet, achieving up to a 3% improvement in classification accuracy.
  • The paper highlights Mish's smooth, non-monotonic design as a natural optimization regularizer, reducing dependency on additional normalization layers.

Mish: A Self-Regularized Non-Monotonic Activation Function

The paper presents Mish, a novel self-regularized non-monotonic activation function designed to enhance neural network performance across several computer vision tasks. Mish is mathematically defined as f(x)=xtanh(softplus(x))f(x) = x \tanh(\text{softplus}(x)). The introduction of Mish aims to address limitations observed in traditional and more recent activation functions like ReLU and Swish.

Overview and Motivation

Activation functions are integral to neural networks, introducing essential non-linearity. Historically, Sigmoid and TanH dominated, but as network depth increased, ReLU became prevalent due to its simplicity and improved convergence. However, ReLU's drawbacks, such as the "Dying ReLU" problem, necessitated exploration of alternatives like Leaky ReLU, ELU, SELU, and Swish.

Mish emerges from this lineage, inspired by Swish and its self-gating properties. It maintains a smooth and continuous profile, is non-monotonic, and preserves a range of negative values. The paper's experimental results demonstrate Mish's superior performance over ReLU and Swish in complex models, hypothesizing that the first derivative's behavior acts as an optimization regularizer.

Experimental Results and Benchmarks

Significant empirical evaluations were conducted across various benchmark datasets including CIFAR-10, ImageNet-1k, and MS-COCO. Mish consistently outperformed ReLU and Swish in image classification tasks, with improvements of up to 3% in accuracy on CIFAR-10 using a variety of architectures like ResNet-20 and DenseNet-121.

On ImageNet-1k, Mish demonstrated a 1% improvement in Top-1 accuracy over Leaky ReLU in CSP-ResNet-50 architectures. The consistency of these results was also validated in real-world applications like object detection on the MS-COCO dataset, where Mish showed a 2.1% average precision improvement over Leaky ReLU when integrated into CSP-DarkNet-53 backbones.

Implications and Future Directions

The introduction of Mish offers both practical and theoretical implications. Practically, it offers a robust activation function option for researchers and practitioners aiming to improve deep learning model performance, especially in vision tasks. Theoretically, the paper opens avenues for further investigation into the regularizing effects of activation function design, particularly the role of the first derivative.

Future developments could explore optimizing Mish's computational efficiency, as Mish-CUDA has already demonstrated promising reductions in training overhead. Further research could also focus on deriving a normalizing parameter for Mish, potentially reducing reliance on batch normalization layers.

Conclusion

Mish represents a meaningful contribution to the development of activation functions, enhancing model expressivity and optimization stability while maintaining computational feasibility. Its consistent performance improvements across various architectures and datasets suggest it is a valuable addition to the toolkit of both researchers and industry practitioners. The underlying mechanisms of its regularization effects warrant further exploration, potentially informing the design of new activation functions with enhanced performance characteristics.

Youtube Logo Streamline Icon: https://streamlinehq.com