Advancing State of the Art in Language Modeling (2312.03735v1)
Abstract: Generalization is arguably the most important goal of statistical LLMing research. Publicly available benchmarks and papers published with an open-source code have been critical to advancing the field. However, it is often very difficult, and sometimes even impossible, to reproduce the results fully as reported in publications. In this paper, we propose a simple framework that should help advance the state of the art in LLMing in terms of generalization. We propose to publish not just the code, but also probabilities on dev and test sets with future publications so that one can easily add the new model into an ensemble. This has crucial advantages: it is much easier to determine whether a newly proposed model is actually complementary to the current baseline. Therefore, instead of inventing new names for the old tricks, the scientific community can advance faster. Finally, this approach promotes diversity of ideas: one does not need to create an individual model that is the new state of the art to attract attention; it will be sufficient to develop a new model that learns patterns which other models do not. Thus, even a suboptimal model can be found to have value. Remarkably, our approach has yielded new state-of-the-art results across various LLMing benchmarks up to 10%.
- Ensemble approach for natural language question answering problem. In 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW), pages 180–183.
- Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853.
- Quasi-recurrent neural networks. CoRR, abs/1611.01576.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078.
- Transformer-xl: Attentive language models beyond a fixed-length context. CoRR, abs/1901.02860.
- Language modeling with gated convolutional networks. CoRR, abs/1612.08083.
- J. L. Elman. 1990. Finding structure in time. Cognitive Science, 14:179–211.
- Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks.
- Joshua Goodman. 2001. A bit of progress in language modeling. CoRR, cs.CL/0108005.
- Efficiently modeling long sequences with structured state spaces. CoRR, abs/2111.00396.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
- Adaptive mixtures of local experts. Neural Computation, 3(1):79–87.
- Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2Nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
- Generalization through memorization: Nearest neighbor language models. CoRR, abs/1911.00172.
- Mega: Moving average equipped gated attention.
- Delight: Very deep and light-weight transformer. CoRR, abs/2008.00623.
- Regularizing and optimizing LSTM language models. CoRR, abs/1708.02182.
- Pointer sentinel mixture models. CoRR, abs/1609.07843.
- Recurrent neural network based language model. volume 2, pages 1045–1048.
- Tomas Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pages 234–239.
- On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1):1–38.
- Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5493–5505, Online. Association for Computational Linguistics.
- Omer Sagi and Lior Rokach. 2018. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8:e1249.
- Learning associative inference using fast weight memory. In International Conference on Learning Representations.
- Claude Elwood Shannon. 1951. Prediction and entropy of printed english. Bell System Technical Journal, 30:50–64.
- Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR, abs/1701.06538.
- Efficient recurrent architectures through activity sparsity and sparse back-propagation through time.
- Direct output connection for a high-rank language model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4599–4609, Brussels, Belgium. Association for Computational Linguistics.
- Attention is all you need. CoRR, abs/1706.03762.
- Breaking the softmax bottleneck: A high-rank RNN language model. CoRR, abs/1711.03953.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.