Positional Encoding Helps Recurrent Neural Networks Handle a Large Vocabulary (2402.00236v5)
Abstract: This study reports an unintuitive finding that positional encoding enhances learning of recurrent neural networks (RNNs). Positional encoding is a high-dimensional representation of time indices on input data. Most famously, positional encoding complements the capabilities of Transformer neural networks, which lack an inherent mechanism for representing the data order. By contrast, RNNs can encode the temporal information of data points on their own, rendering their use of positional encoding seemingly redundant/unnecessary. Nonetheless, investigations through synthetic benchmarks reveal an advantage of coupling positional encoding and RNNs, especially for handling a large vocabulary that yields low-frequency tokens. Further scrutinization unveils that these low-frequency tokens destabilizes the gradients of vanilla RNNs, and the positional encoding resolves this instability. These results shed a new light on the utility of positional encoding beyond its canonical role as a timekeeper for Transformers.
- Unitary evolution recurrent neural networks. In Balcan, M. F. and Weinberger, K. Q. (eds.), Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pp. 1120–1128, New York, New York, USA, 20–22 Jun 2016. PMLR.
- Oscillatory brain theory: a new trend in neuroscience. IEEE Engineering in Medicine and Biology Magazine, 18(3):56–66, 1999. doi: 10.1109/51.765190.
- What makes us tick? functional and neural mechanisms of interval timing. Nature Reviews Neuroscience, 6(10):755–765, 2005. doi: 10.1038/nrn1764.
- Dilated recurrent neural networks. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Recurrent neural networks as weighted language recognizers. In Walker, M., Ji, H., and Stent, A. (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2261–2271, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1205.
- Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734, Doha, Qatar, October 2014. Association for Computational Linguistics. doi: 10.3115/v1/D14-1179.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2978–2988, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1285.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423.
- Synchrony and the binding problem in macaque visual cortex. Journal of Vision, 8(7):30, 11 2008. ISSN 1534-7362. doi: 10.1167/8.7.30.
- Coherent oscillations: A mechanism of feature linking in the visual cortex? Biological Cybernetics, 60(2):121–130, 1988. doi: 10.1007/BF00202899.
- Elman, J. L. Finding structure in time. Cognitive Science, 14(2):179–211, 1990. doi: 10.1207/s15516709cog1402˙1.
- Convolutional sequence to sequence learning. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1243–1252. PMLR, 06–11 Aug 2017.
- Reservoir computing universality with stochastic inputs. IEEE Transactions on Neural Networks and Learning Systems, 31(1):100–112, 2020. doi: 10.1109/TNNLS.2019.2899649.
- Graves, A. Generating sequences with recurrent neural networks. arXiv:1308.0850, 2013.
- Oscillatory responses in cat visual cortex exhibit inter-columnar synchronization which reflects global stimulus properties. Nature, 338(6213):334–337, 1989. doi: 10.1038/338334a0.
- Echo state networks are universal. Neural Networks, 108:495–508, 2018. ISSN 0893-6080. doi: 10.1016/j.neunet.2018.08.025.
- The neural basis of intermittent motor control in humans. Proceedings of the National Academy of Sciences, 99(4):2299–2302, 2002. doi: 10.1073/pnas.032682099.
- Grossberg, S. Adaptive pattern classification and universal recoding: Ii. feedback, expectation, olfaction, illusions. Biological Cybernetics, 23(4):187–202, 1976. doi: 10.1007/BF00340335.
- HiPPO: Recurrent memory with optimal polynomial projections. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1474–1487. Curran Associates, Inc., 2020.
- Denoising diffusion probabilistic models. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 6840–6851. Curran Associates, Inc., 2020.
- Long short-term memory. Neural Computation, 9(8):1735–1780, 1997. ISSN 0899-7667. doi: 10.1162/neco.1997.9.8.1735.
- Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004. ISSN 0036-8075. doi: 10.1126/science.1091277.
- Tunable efficient unitary neural networks (EUNN) and their application to RNNs. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1733–1741. PMLR, 06–11 Aug 2017.
- Gated orthogonal recurrent units: On learning to forget. Neural Computation, 31(4):765–783, 2019. doi: 10.1162/neco˙a˙01174.
- Shap-E: Generating conditional 3D implicit functions, 2023.
- Kahana, M. J. The cognitive correlates of human brain oscillations. Journal of Neuroscience, 26(6):1669–1672, 2006. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.3737-05c.2006.
- Encoding position improves recurrent neural text summarizers. In Abbas, M. and Freihat, A. A. (eds.), Proceedings of the 3rd International Conference on Natural Language and Speech Processing, pp. 142–150, Trento, Italy, September 2019. Association for Computational Linguistics.
- Learning long-term motor timing/patterns on an orthogonal basis in random neural networks. Neural Networks, 163:298–311, 2023. ISSN 0893-6080. doi: 10.1016/j.neunet.2023.04.006.
- Sharp nearby, fuzzy far away: How neural language models use context. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 284–294, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1027.
- Modeling with recurrent neural networks for open vocabulary slots. In Bender, E. M., Derczynski, L., and Isabelle, P. (eds.), Proceedings of the 27th International Conference on Computational Linguistics, pp. 2778–2790, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics.
- Adam: A method for stochastic optimization. In Proceedings of 3rd International Conference on Learning Representations (ICLR), San Diego, California, 2015.
- SHAPE: Shifted absolute position embedding for transformers. In Moens, M.-F., Huang, X., Specia, L., and Yih, S. W.-t. (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3309–3321, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.266.
- Klimesch, W. Memory processes, brain oscillations and EEG synchronization. International Journal of Psychophysiology, 24(1):61–100, 1996. ISSN 0167-8760. doi: 10.1016/S0167-8760(96)00057-8. New Advances in EEG and cognition.
- Robust timing and motor patterns by taming chaos in recurrent neural networks. Nature Neuroscience, 16(7):925–933, 2013. doi: 10.1038/nn.3405.
- Backpropagation through time and the brain. Current Opinion in Neurobiology, 55:82–89, 2019. ISSN 0959-4388. doi: 10.1016/j.conb.2019.01.011. Machine Learning, Big Data, and Neuroscience.
- Random synaptic feedback weights support error backpropagation for deep learning. Nature Communications, 7(1):13276, 2016. doi: 10.1038/ncomms13276.
- SGDR: stochastic gradient descent with warm restarts. In Proceedings of the 5th International Conference on Learning Representations (ICLR). OpenReview.net, 2017.
- Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14(11):2531–2560, 2002. doi: 10.1162/089976602760407955.
- Central pattern generators and the control of rhythmic movements. Current Biology, 11(23):R986–R996, 2001. doi: 10.1016/S0960-9822(01)00581-4.
- Cortico-striatal circuits and interval timing: coincidence detection of oscillatory processes. Cognitive Brain Research, 21(2):139–170, 2004. ISSN 0926-6410. doi: 10.1016/j.cogbrainres.2004.06.012. Neuroimaging of Interval Timing.
- Asymmetric amplitude modulations of brain oscillations generate slow evoked responses. Journal of Neuroscience, 28(31):7781–7787, 2008. ISSN 0270-6474. doi: 10.1523/JNEUROSCI.1631-08.2008.
- Merker, B. Cortical gamma oscillations: the functional key is activation, not cognition. Neuroscience & Biobehavioral Reviews, 37(3):401–417, 2013. ISSN 0149-7634. doi: 10.1016/j.neubiorev.2013.01.013.
- Miall, C. The storage of time intervals using oscillating neurons. Neural Computation, 1(3):359–371, 09 1989. ISSN 0899-7667. doi: 10.1162/neco.1989.1.3.359.
- NeRF: Representing scenes as neural radiance fields for view synthesis. In Vedaldi, A., Bischof, H., Brox, T., and Frahm, J.-M. (eds.), Proceedings of the European Conference on Computer Vision (ECCV), pp. 405–421, Cham, 2020. Springer International Publishing. ISBN 978-3-030-58452-8. doi: 10.1007/978-3-030-58452-8˙24.
- Milner, P. M. A model for visual shape recognition. Psychological Review, 81(6):521–535, 1974. ISSN 0033-295X. doi: 10.1037/h0037149.
- Phased lstm: Accelerating recurrent network training for long or event-based sequences. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- A novel mechanism for evoked responses in the human brain. European Journal of Neuroscience, 25(10):3146–3154, 2007. doi: 10.1111/j.1460-9568.2007.05553.x.
- Non-zero mean and asymmetry of neuronal oscillations have different implications for evoked responses. Clinical Neurophysiology, 121(2):186–193, 2010. ISSN 1388-2457. doi: 10.1016/j.clinph.2009.09.028.
- Functional role of gamma and theta oscillations in episodic memory. Neuroscience & Biobehavioral Reviews, 34(7):1023–1035, 2010. ISSN 0149-7634. doi: 10.1016/j.neubiorev.2009.12.014. Binding Processes: Neurodynamics and Functional Role in Memory and Action.
- Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, 2017.
- PyTorch: An imperative style, high-performance deep learning library. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32, pp. 8024–8035. Curran Associates, Inc., 2019.
- Oscillations and sparsening of odor representations in the mushroom body. Science, 297(5580):359–365, 2002. doi: 10.1126/science.1070502.
- A phase-reduced neuro-mechanical model for insect locomotion: feed-forward stability and proprioceptive feedback. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 368(1930):5087–5104, 2010. doi: 10.1098/rsta.2010.0134.
- On the spectral bias of neural networks. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp. 5301–5310. PMLR, 09–15 Jun 2019.
- Randomized positional encodings boost length generalization of transformers. In Rogers, A., Boyd-Graber, J., and Okazaki, N. (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 1889–1903, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-short.161.
- Visual binding through reentrant connectivity and dynamic synchronization in a brain-based device. Cerebral Cortex, 14(11):1185–1199, 11 2004. ISSN 1047-3211. doi: 10.1093/cercor/bhh079.
- Synchrony unbound: A critical evaluation of the temporal binding hypothesis. Neuron, 24(1):67–77, 2024/01/28 1999. doi: 10.1016/S0896-6273(00)80822-3.
- Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2074.
- Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
- Siegelmann, H. T. Recurrent neural networks and finite automata. Computational Intelligence, 12(4):567–574, 1996. doi: 10.1111/j.1467-8640.1996.tb00277.x.
- Siegelmann, H. T. Neural Networks and Analog Computation: Beyond the Turing Limit. Birkhauser Boston Inc., Cambridge, MA, USA, 1999. ISBN 0-8176-3949-7.
- On the computational power of neural nets. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pp. 440–449, New York, NY, USA, 1992. Association for Computing Machinery. ISBN 089791497X. doi: 10.1145/130385.130432.
- Analog computation via neural networks. Theoretical Computer Science, 131(2):331–360, 1994. ISSN u. doi: 10.1016/0304-3975(94)90178-3.
- On the computational power of neural nets. Journal of Computer and System Sciences, 50(1):132–150, 1995. ISSN 0022-0000. doi: 10.1006/jcss.1995.1013.
- An RNN model for generating sentences with a desired word at a desired position. Technical Gazette, 27(1):81–88, February 2020. ISSN 1848-6339. doi: 10.17559/tv-20190929153200.
- Score-based generative modeling through stochastic differential equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021.
- Investigation of the dynamics underlying periodic complexes in the eeg. Biological Cybernetics, 80(1):57–69, 1999. doi: 10.1007/s004220050504.
- LSTM neural networks for language modeling. In Proceedings of INTERSPEECH, pp. 194–197, 2012. doi: 10.21437/Interspeech.2012-65.
- The hippocampal memory indexing theory. Behavioral Neuroscience, 100(2):147–154, 1986. ISSN 0735-7044. doi: 10.1037/0735-7044.100.2.147.
- Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc., 2017.
- Driving reservoir models with oscillations: a solution to the extreme structural sensitivity of chaotic networks. Journal of Computational Neuroscience, 41(3):305–322, 2016. doi: 10.1007/s10827-016-0619-3.
- Legendre memory units: Continuous-time representation in recurrent neural networks. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- von der Malsburg, C. The correlation theory of brain function. In Domany, E., van Hemmen, J. L., and Schulten, K. (eds.), Models of Neural Networks: Temporal Aspects of Coding and Information Processing in Biological Systems, pp. 95–119. Springer New York, New York, NY, 1981/1994. ISBN 978-1-4612-4320-5. doi: 10.1007/978-1-4612-4320-5˙2. Originally published as Internal Report 81-2, Dept. of Neurobiology, Max-Planck-Institute for Biophysical Chemistry, Göttingen, Germany.
- Odour encoding by temporal sequences of firing in oscillating neural assemblies. Nature, 384(6605):162–166, 1996. doi: 10.1038/384162a0.
- On the practical computational power of finite precision rnns for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 740–745. Association for Computational Linguistics, 2018.
- Spike phase precession persists after transient intrahippocampal perturbation. Nature Neuroscience, 8(1):67–71, 2005. doi: 10.1038/nn1369.