Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Method on Searching Better Activation Functions (2405.12954v2)

Published 19 May 2024 in cs.LG and cs.AI

Abstract: The success of artificial neural networks (ANNs) hinges greatly on the judicious selection of an activation function, introducing non-linearity into network and enabling them to model sophisticated relationships in data. However, the search of activation functions has largely relied on empirical knowledge in the past, lacking theoretical guidance, which has hindered the identification of more effective activation functions. In this work, we offer a proper solution to such issue. Firstly, we theoretically demonstrate the existence of the worst activation function with boundary conditions (WAFBC) from the perspective of information entropy. Furthermore, inspired by the Taylor expansion form of information entropy functional, we propose the Entropy-based Activation Function Optimization (EAFO) methodology. EAFO methodology presents a novel perspective for designing static activation functions in deep neural networks and the potential of dynamically optimizing activation during iterative training. Utilizing EAFO methodology, we derive a novel activation function from ReLU, known as Correction Regularized ReLU (CRReLU). Experiments conducted with vision transformer and its variants on CIFAR-10, CIFAR-100 and ImageNet-1K datasets demonstrate the superiority of CRReLU over existing corrections of ReLU. Extensive empirical studies on task of LLM fine-tuning, CRReLU exhibits superior performance compared to GELU, suggesting its broader potential for practical applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405:947–51, 07 2000.
  2. What is the best multi-stage architecture for object recognition? In 2009 IEEE 12th International Conference on Computer Vision, pages 2146–2153, 2009.
  3. Rectified linear units improve restricted boltzmann machines. In International Conference on Machine Learning, 2010.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  5. Training data-efficient image transformers & amp; distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021.
  6. Transformer in transformer, 2021.
  7. Learning multiple layers of features from tiny images. 2009.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  9. Direct preference optimization: Your language model is secretly a reward model, 2023.
  10. OpenAI. Gpt-4 technical report, 2023.
  11. Thibaut Lavril etal. Hugo Touvron. Llama: Open and efficient foundation language models, 2023.
  12. Gemma Team. Gemma: Open models based on gemini research and technology, 2024.
  13. Proximal policy optimization algorithms, 2017.
  14. Vlp: Vision language planning for autonomous driving, 2024.
  15. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994.
  16. Exploring strategies for training deep neural networks. J. Mach. Learn. Res., 10:1–40, 2009.
  17. Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4:251–257, 1991.
  18. Andrew L. Maas. Rectifier nonlinearities improve neural network acoustic models. 2013.
  19. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification, 2015.
  20. Fast and accurate deep network learning by exponential linear units (elus), 2016.
  21. Jonathan T. Barron. Continuously differentiable exponential linear units, 2017.
  22. Searching for activation functions, 2017.
  23. Diganta Misra. Mish: A self regularized non-monotonic activation function. In British Machine Vision Conference, 2020.
  24. Gaussian error linear units (gelus), 2023.
  25. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  26. Minhyeok Lee. Gelu activation function in deep learning: A comprehensive mathematical analysis and performance, 2023.
  27. Kan: Kolmogorov-arnold networks, 2024.
  28. Christopher K. I. Williams. Computing with infinite networks. In Neural Information Processing Systems, 1996.
  29. Deep neural networks as gaussian processes, 2018.
  30. Towards nngp-guided neural architecture search, 2020.
  31. Wide neural networks as gaussian processes: Lessons from deep equilibrium models, 2023.
  32. Convolution-weight-distribution assumption: Rethinking the criteria of channel pruning, 2021.
  33. Understanding dataset difficulty with 𝒱𝒱\mathcal{V}caligraphic_V-usable information, 2022.
  34. Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022.
  35. Transferring inductive biases through knowledge distillation, 2020.
  36. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com