Emergent Mind

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

(2401.17377)
Published Jan 30, 2024 in cs.CL , cs.AI , cs.IR and

Abstract

Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. First, we train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new $\infty$-gram LM with backoff. Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. When analyzing machine-generated text, we also observe irregularities in the machine--$\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more study on how to best use verbatim information retrieved from large text corpora.

Overview

  • The paper presents the largest n-gram model to date, trained on 1.4 trillion tokens with no upper context limit.

  • A new model called infini-gram replaces traditional n-gram tables with suffix arrays, requiring less storage and providing rapid counting.

  • The ∞-gram LM achieves 47% accuracy in next-token prediction, and when combined with neural LMs, reduces perplexity by up to 73%.

  • Applications include data curation, large corpus understanding, enhancing neural LMs, mitigating AI hallucinations, and providing neural model diagnostics.

  • The infini-gram engine is open-sourced to encourage further research in scalable n-gram models for language understanding.

Overview of Infini-gram

With the rise of neural LLMs, traditional n-gram language models (LMs) appeared to be losing relevance. However, this paper challenges that perception by unveiling the largest n-gram model ever built, trained on a whopping 1.4 trillion tokens. The authors present a novel approach: an ∞-gram LM with no fixed upper limit n for context length. This innovation enables the n-gram model to utilize arbitrarily large contexts, significantly improving its predictive performance.

Technical Innovation

Crucially, the researchers introduced a performant engine named infini-gram, sidestepping the traditional, resource-intensive n-gram count tables. Infini-gram hinges on suffix arrays, a data structure that allows rapid n-gram counting during inference. With just 7 bytes of storage per token, this arrangement showcases remarkable efficiency—building an index for Trillion-token scale data takes less than three days using a sole 80-core CPU node, boasting sub-second latency in query processing.

Empirical Findings

Empirical analyses underscore the potential of the ∞-gram LM in capturing richer textual context and complementing neural LMs. For instance, the authors report a consistent accuracy of 47% in next-token prediction for human-written text, which increases with the length of the considered textual context. More strikingly, integrating ∞-gram models with large neural models results in up to a 73% reduction in language modeling perplexity—indicative of a significant leap in performance.

Applications and Implications

The authors envision diverse applications for infini-gram, including data curation, understanding large text corpora, and enhancing neural LMs. They also discuss how the approach can mitigate hallucination issues in AI-generated content by ensuring data fidelity. Moreover, the paper explores the use of infini-gram for generating accurate, contextually relevant responses and diagnostics for neural models' training data influence.

Conclusion

In conclusion, this research breathes new life into n-gram models, demonstrating their enduring value in the era of neural LMs. The innovation in scalable ∞-gram LMs, backed by the efficient infini-gram engine, offers powerful tools for textual analysis and significantly advances the performance of existing LMs. The authors have open-sourced the engine to spur further research, setting the stage for transformative applications in data-driven language understanding.

Get summaries of trending AI/ML papers delivered straight to your inbox

Unsubscribe anytime.

References
  1. Quantitative analysis of culture using millions of digitized books. Science, 331:176 – 182, 2011. https://api.semanticscholar.org/CorpusID:40104730.

  2. Mining source code repositories at massive scale using language modeling. 2013 10th Working Conference on Mining Software Repositories (MSR), pp.  207–216, 2013. https://api.semanticscholar.org/CorpusID:1857729.

  3. Adaptive input representations for neural language modeling. In Proceedings of the International Conference on Learning Representations
  4. Improving language models by retrieving from trillions of tokens. In Proceedings of the International Conference of Machine Learning
  5. Accelerating Large Language Model Decoding with Speculative Sampling
  6. Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Conference on Empirical Methods in Natural Language Processing, 2021. https://api.semanticscholar.org/CorpusID:237568724.

  7. What's In My Big Data?
  8. All our n-gram are belong to you. Google Machine Translation Team, 20, 2006. https://blog.research.google/2006/08/all-our-n-gram-are-belong-to-you.html.

  9. The Pile: An 800GB Dataset of Diverse Text for Language Modeling
  10. Openllama: An open reproduction of llama, May 2023. https://github.com/openlm-research/open_llama.

  11. Dirk Groeneveld. The big friendly filter. https://github.com/allenai/bff

  12. Retrieval augmented language model pre-training. In Proceedings of the International Conference of Machine Learning
  13. Rest: Retrieval-based speculative decoding. 2023. https://api.semanticscholar.org/CorpusID:265157884.

  14. The Curious Case of Neural Text Degeneration
  15. Atlas: Few-shot Learning with Retrieval Augmented Language Models
  16. Speech and language processing - an introduction to natural language processing, computational linguistics, and speech recognition. In Prentice Hall series in artificial intelligence, 2000. https://api.semanticscholar.org/CorpusID:60691216.

  17. Linear work suffix array construction. J. ACM, 53:918–936, 2006. https://api.semanticscholar.org/CorpusID:12825385.

  18. Slava M. Katz. Estimation of probabilities from sparse data for the language model component of a speech recognizer. IEEE Trans. Acoust. Speech Signal Process., 35:400–401, 1987. https://api.semanticscholar.org/CorpusID:6555412.

  19. Suffix trees as language models. In International Conference on Language Resources and Evaluation, 2012. https://api.semanticscholar.org/CorpusID:12071964.

  20. Generalization through memorization: Nearest neighbor language models. In Proceedings of the International Conference on Learning Representations
  21. Copy is all you need. In Proceedings of the International Conference on Learning Representations
  22. Deduplicating training data makes language models better. In Proceedings of the Association for Computational Linguistics
  23. Residual learning of neural text generation with n-gram language model. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2022. https://aclanthology.org/2022.findings-emnlp.109.

  24. Data Portraits: Recording Foundation Model Training Data
  25. SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore
  26. Nonparametric masked language modeling. In Findings of ACL, 2023b.
  27. Language Models as Knowledge Bases?
  28. The roots search tool: Data transparency for llms. In Annual Meeting of the Association for Computational Linguistics, 2023. https://api.semanticscholar.org/CorpusID:257219882.

  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9
  30. Compact, efficient and unlimited capacity: Language modeling with compressed suffix trees. In Conference on Empirical Methods in Natural Language Processing, 2015. https://api.semanticscholar.org/CorpusID:225428.

  31. REPLUG: Retrieval-Augmented Black-Box Language Models
  32. Dolma: An Open Corpus of 3 Trillion Tokens for Language Model Pretraining Research. Technical report, Allen Institute for AI, 2023. Released under ImpACT License as Medium Risk artifact, https://github.com/allenai/dolma.

  33. Herman Stehouwer and Menno van Zaanen. Using suffix arrays as language models: Scaling the n-gram. 2010. https://api.semanticscholar.org/CorpusID:18379946.

  34. Together. RedPajama: An open source recipe to reproduce LLaMA training dataset, 2023. https://github.com/togethercomputer/RedPajama-Data.

  35. LLaMA: Open and Efficient Foundation Language Models
  36. Llama 2: Open Foundation and Fine-Tuned Chat Models
  37. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.

  38. Training language models with memory augmentation. In Proceedings of Empirical Methods in Natural Language Processing

Show All 38