Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 93 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 17 tok/s
GPT-5 High 14 tok/s Pro
GPT-4o 97 tok/s
GPT OSS 120B 455 tok/s Pro
Kimi K2 194 tok/s Pro
2000 character limit reached

Advancing State of the Art in Language Modeling (2312.03735v1)

Published 28 Nov 2023 in cs.CL and cs.AI

Abstract: Generalization is arguably the most important goal of statistical LLMing research. Publicly available benchmarks and papers published with an open-source code have been critical to advancing the field. However, it is often very difficult, and sometimes even impossible, to reproduce the results fully as reported in publications. In this paper, we propose a simple framework that should help advance the state of the art in LLMing in terms of generalization. We propose to publish not just the code, but also probabilities on dev and test sets with future publications so that one can easily add the new model into an ensemble. This has crucial advantages: it is much easier to determine whether a newly proposed model is actually complementary to the current baseline. Therefore, instead of inventing new names for the old tricks, the scientific community can advance faster. Finally, this approach promotes diversity of ideas: one does not need to create an individual model that is the new state of the art to attract attention; it will be sufficient to develop a new model that learns patterns which other models do not. Thus, even a suboptimal model can be found to have value. Remarkably, our approach has yielded new state-of-the-art results across various LLMing benchmarks up to 10%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Ensemble approach for natural language question answering problem. In 2019 Seventh International Symposium on Computing and Networking Workshops (CANDARW), pages 180–183.
  2. Alexei Baevski and Michael Auli. 2018. Adaptive input representations for neural language modeling. CoRR, abs/1809.10853.
  3. Quasi-recurrent neural networks. CoRR, abs/1611.01576.
  4. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078.
  5. Transformer-xl: Attentive language models beyond a fixed-length context. CoRR, abs/1901.02860.
  6. Language modeling with gated convolutional networks. CoRR, abs/1612.08083.
  7. J. L. Elman. 1990. Finding structure in time. Cognitive Science, 14:179–211.
  8. Yarin Gal and Zoubin Ghahramani. 2016. A theoretically grounded application of dropout in recurrent neural networks.
  9. Joshua Goodman. 2001. A bit of progress in language modeling. CoRR, cs.CL/0108005.
  10. Efficiently modeling long sequences with structured state spaces. CoRR, abs/2111.00396.
  11. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
  12. Adaptive mixtures of local experts. Neural Computation, 3(1):79–87.
  13. Daniel Jurafsky and James H. Martin. 2009. Speech and Language Processing (2Nd Edition). Prentice-Hall, Inc., Upper Saddle River, NJ, USA.
  14. Generalization through memorization: Nearest neighbor language models. CoRR, abs/1911.00172.
  15. Mega: Moving average equipped gated attention.
  16. Delight: Very deep and light-weight transformer. CoRR, abs/2008.00623.
  17. Regularizing and optimizing LSTM language models. CoRR, abs/1708.02182.
  18. Pointer sentinel mixture models. CoRR, abs/1609.07843.
  19. Recurrent neural network based language model. volume 2, pages 1045–1048.
  20. Tomas Mikolov and Geoffrey Zweig. 2012. Context dependent recurrent neural network language model. In 2012 IEEE Spoken Language Technology Workshop (SLT), pages 234–239.
  21. On structuring probabilistic dependences in stochastic language modelling. Computer Speech & Language, 8(1):1–38.
  22. Shortformer: Better language modeling using shorter inputs. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5493–5505, Online. Association for Computational Linguistics.
  23. Omer Sagi and Lior Rokach. 2018. Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8:e1249.
  24. Learning associative inference using fast weight memory. In International Conference on Learning Representations.
  25. Claude Elwood Shannon. 1951. Prediction and entropy of printed english. Bell System Technical Journal, 30:50–64.
  26. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. CoRR, abs/1701.06538.
  27. Efficient recurrent architectures through activity sparsity and sparse back-propagation through time.
  28. Direct output connection for a high-rank language model. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4599–4609, Brussels, Belgium. Association for Computational Linguistics.
  29. Attention is all you need. CoRR, abs/1706.03762.
  30. Breaking the softmax bottleneck: A high-rank RNN language model. CoRR, abs/1711.03953.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com