Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mitigating Communication Costs in Neural Networks: The Role of Dendritic Nonlinearity (2306.11950v2)

Published 21 Jun 2023 in cs.NE, cs.LG, and q-bio.NC

Abstract: Our understanding of biological neuronal networks has profoundly influenced the development of artificial neural networks (ANNs). However, neurons utilized in ANNs differ considerably from their biological counterparts, primarily due to the absence of complex dendritic trees with local nonlinearities. Early studies have suggested that dendritic nonlinearities could substantially improve the learning capabilities of neural network models. In this study, we systematically examined the role of nonlinear dendrites within neural networks. Utilizing machine-learning methodologies, we assessed how dendritic nonlinearities influence neural network performance. Our findings demonstrate that dendritic nonlinearities do not substantially affect learning capacity; rather, their primary benefit lies in enabling network capacity expansion while minimizing communication costs through effective localized feature aggregation. This research provides critical insights with significant implications for designing future neural network accelerators aimed at reducing communication overhead during neural network training and inference.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. OpenAI: GPT-4 Technical Report (2023) (3) Wang, Y., Rubel, E.W.: In vivo reversible regulation of dendritic patterning by afferent input in bipolar auditory neurons. Journal of Neuroscience 32(33), 11495–11504 (2012) (4) Benavides-Piccione, R., Regalado-Reyes, M., Fernaud-Espinosa, I., Kastanauskaite, A., Tapia-González, S., León-Espinosa, G., Rojo, C., Insausti, R., Segev, I., DeFelipe, J.: Differential structure of hippocampal ca1 pyramidal neurons in the human and mouse. Cerebral Cortex 30(2), 730–752 (2020) (5) Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021) (6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wang, Y., Rubel, E.W.: In vivo reversible regulation of dendritic patterning by afferent input in bipolar auditory neurons. Journal of Neuroscience 32(33), 11495–11504 (2012) (4) Benavides-Piccione, R., Regalado-Reyes, M., Fernaud-Espinosa, I., Kastanauskaite, A., Tapia-González, S., León-Espinosa, G., Rojo, C., Insausti, R., Segev, I., DeFelipe, J.: Differential structure of hippocampal ca1 pyramidal neurons in the human and mouse. Cerebral Cortex 30(2), 730–752 (2020) (5) Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021) (6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Benavides-Piccione, R., Regalado-Reyes, M., Fernaud-Espinosa, I., Kastanauskaite, A., Tapia-González, S., León-Espinosa, G., Rojo, C., Insausti, R., Segev, I., DeFelipe, J.: Differential structure of hippocampal ca1 pyramidal neurons in the human and mouse. Cerebral Cortex 30(2), 730–752 (2020) (5) Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021) (6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021) (6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  2. Wang, Y., Rubel, E.W.: In vivo reversible regulation of dendritic patterning by afferent input in bipolar auditory neurons. Journal of Neuroscience 32(33), 11495–11504 (2012) (4) Benavides-Piccione, R., Regalado-Reyes, M., Fernaud-Espinosa, I., Kastanauskaite, A., Tapia-González, S., León-Espinosa, G., Rojo, C., Insausti, R., Segev, I., DeFelipe, J.: Differential structure of hippocampal ca1 pyramidal neurons in the human and mouse. Cerebral Cortex 30(2), 730–752 (2020) (5) Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021) (6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Benavides-Piccione, R., Regalado-Reyes, M., Fernaud-Espinosa, I., Kastanauskaite, A., Tapia-González, S., León-Espinosa, G., Rojo, C., Insausti, R., Segev, I., DeFelipe, J.: Differential structure of hippocampal ca1 pyramidal neurons in the human and mouse. Cerebral Cortex 30(2), 730–752 (2020) (5) Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021) (6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021) (6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  3. Benavides-Piccione, R., Regalado-Reyes, M., Fernaud-Espinosa, I., Kastanauskaite, A., Tapia-González, S., León-Espinosa, G., Rojo, C., Insausti, R., Segev, I., DeFelipe, J.: Differential structure of hippocampal ca1 pyramidal neurons in the human and mouse. Cerebral Cortex 30(2), 730–752 (2020) (5) Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021) (6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021) (6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  4. Adusei, M., Hasse, J.M., Briggs, F.: Morphological evidence for multiple distinct channels of corticogeniculate feedback originating in mid-level extrastriate visual areas of the ferret. Brain Structure and Function 226, 2777–2791 (2021) (6) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  5. Ascoli, G.A., Donohue, D.E., Halavi, M.: Neuromorpho. org: a central resource for neuronal morphologies. Journal of Neuroscience 27(35), 9247–9251 (2007) (7) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  6. Stuart, G., Spruston, N., Häusser, M.: Dendrites. Oxford University Press, Oxford (2016) (8) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  7. Chklovskii, D.B.: Optimal sizes of dendritic and axonal arbors in a topographic projection. Journal of Neurophysiology 83(4), 2113–2119 (2000) (9) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  8. Magee, J.C.: Dendritic integration of excitatory synaptic input. Nature Reviews Neuroscience 1(3), 181–190 (2000) (10) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  9. Schiller, J., Major, G., Koester, H.J., Schiller, Y.: Nmda spikes in basal dendrites of cortical pyramidal neurons. Nature 404(6775), 285–289 (2000) (11) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  10. Polsky, A., Mel, B.W., Schiller, J.: Computational subunits in thin dendrites of pyramidal cells. Nature neuroscience 7(6), 621–627 (2004) (12) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  11. Major, G., Larkum, M.E., Schiller, J.: Active properties of neocortical pyramidal neuron dendrites. Annual review of neuroscience 36, 1–24 (2013) (13) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  12. Poirazi, P., Mel, B.W.: Impact of Active Dendrites and Structural Plasticity on the Memory Capacity of Neural Tissue. Neuron 29(3), 779–796 (2001) (14) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  13. Jadi, M., Polsky, A., Schiller, J., Mel, B.W.: Location-dependent effects of inhibition on local spiking in pyramidal neuron dendrites. PLoS computational biology 8(6), 1002550 (2012) (15) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  14. Wu, X., Liu, X., Li, W., Wu, Q.: Improved expressivity through dendritic neural networks. Advances in neural information processing systems 31 (2018) (16) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  15. Jones, I.S., Kording, K.P.: Might a single neuron solve interesting machine learning problems through successive computations on its dendritic tree? Neural Computation 33(6), 1554–1571 (2021) (17) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  16. Richards, B.A., Lillicrap, T.P.: Dendritic solutions to the credit assignment problem. Current opinion in neurobiology 54, 28–36 (2019) (18) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  17. Kastellakis, G., Poirazi, P.: Synaptic clustering and memory formation. Frontiers in molecular neuroscience 12, 300 (2019) (19) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  18. Park, J., Sandberg, I.W.: Universal approximation using radial-basis-function networks. Neural computation 3(2), 246–257 (1991) (20) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  19. Dally, W.: On the model of computation: point: We Must Extend Our Model of Computation to Account for Cost and Location. Communications of the ACM 65(9), 30–31 (2022) (21) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  20. Levy, W.B., Calvert, V.G.: Communication consumes 35 times more energy than computation in the human cortex, but both costs are needed to predict synapse number. Proceedings of the National Academy of Sciences 118(18), 2008173118 (2021) (22) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  21. Poirazi, P., Brannon, T., Mel, B.W.: Pyramidal Neuron as Two-Layer Neural Network. Neuron 37(6), 989–999 (2003) (23) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  22. Cayco-Gajic, N.A., Clopath, C., Silver, R.A.: Sparse synaptic connectivity is required for decorrelation and pattern separation in feedforward networks. Nature communications 8(1), 1116 (2017) (24) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  23. Marr, D.: A theory of cerebellar cortex. The Journal of Physiology 202(2), 437–470 (1969) (25) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  24. Luo, L.: Architectures of neuronal circuits. Science 373(6559), 7285 (2021) (26) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  25. Sanger, T.D., Yamashita, O., Kawato, M.: Expansion coding and computation in the cerebellum: 50 years after the Marr–Albus codon theory. The Journal of Physiology 598(5), 913–928 (2020) (27) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  26. Babadi, B., Sompolinsky, H.: Sparseness and expansion in sensory representations. Neuron 83(5), 1213–1226 (2014) (28) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  27. Olshausen, B.A., Field, D.J.: Sparse coding of sensory inputs. Current opinion in neurobiology 14(4), 481–487 (2004) (29) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  28. Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., Sutskever, I.: Deep double descent: where bigger models and more data hurt*. Journal of Statistical Mechanics: Theory and Experiment 2021(12), 124003 (2021) (30) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  29. Consalez, G.G., Goldowitz, D., Casoni, F., Hawkes, R.: Origins, Development, and Compartmentation of the Granule Cells of the Cerebellum. Frontiers in Neural Circuits 14 (2021) (31) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  30. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020) (32) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  31. OpenAI: ChatGPT: Optimizing Language Models for Dialogue (2022). https://openai.com/blog/chatgpt/ Accessed 2023-05-25 (33) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  32. Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., et al.: Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990 (2022) (34) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  33. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016) (35) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  34. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009) (36) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  35. Murty, U., Bondy, A.: Graph Theory (graduate texts in mathematics 244). Springer (2008) (37) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  36. Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: Proceedings of the 30th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 28, pp. 1319–1327. PMLR, Atlanta, Georgia, USA (2013) (38) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  37. Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) (39) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  38. Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210 (2015) (40) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  39. Hassani, A., Walton, S., Shah, N., Abuduweili, A., Li, J., Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) (41) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019) Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
  40. Li, J., Lavrukhin, V., Ginsburg, B., Leary, R., Kuchaiev, O., Cohen, J.M., Nguyen, H., Gadde, R.T.: Jasper: An end-to-end convolutional neural acoustic model. arXiv preprint arXiv:1904.03288 (2019)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com