Are $n$-gram language models still relevant in this era of neural LLMs? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest $n$-gram LM ever built. Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $\infty$-gram LM with backoff. Instead of pre-computing $n$-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as $n$-gram with arbitrary $n$) probabilities with millisecond-level latency. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their perplexity. When analyzing machine-generated text, we also observe irregularities in the machine--$\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers.
The paper presents the largest n-gram model to date, trained on 1.4 trillion tokens with no upper context limit.
A new model called infini-gram replaces traditional n-gram tables with suffix arrays, requiring less storage and providing rapid counting.
The ∞-gram LM achieves 47% accuracy in next-token prediction, and when combined with neural LMs, reduces perplexity by up to 73%.
Applications include data curation, large corpus understanding, enhancing neural LMs, mitigating AI hallucinations, and providing neural model diagnostics.
The infini-gram engine is open-sourced to encourage further research in scalable n-gram models for language understanding.
With the rise of neural LLMs, traditional n-gram language models (LMs) appeared to be losing relevance. However, this paper challenges that perception by unveiling the largest n-gram model ever built, trained on a whopping 1.4 trillion tokens. The authors present a novel approach: an ∞-gram LM with no fixed upper limit n for context length. This innovation enables the n-gram model to utilize arbitrarily large contexts, significantly improving its predictive performance.
Crucially, the researchers introduced a performant engine named infini-gram, sidestepping the traditional, resource-intensive n-gram count tables. Infini-gram hinges on suffix arrays, a data structure that allows rapid n-gram counting during inference. With just 7 bytes of storage per token, this arrangement showcases remarkable efficiency—building an index for Trillion-token scale data takes less than three days using a sole 80-core CPU node, boasting sub-second latency in query processing.
Empirical analyses underscore the potential of the ∞-gram LM in capturing richer textual context and complementing neural LMs. For instance, the authors report a consistent accuracy of 47% in next-token prediction for human-written text, which increases with the length of the considered textual context. More strikingly, integrating ∞-gram models with large neural models results in up to a 73% reduction in language modeling perplexity—indicative of a significant leap in performance.
The authors envision diverse applications for infini-gram, including data curation, understanding large text corpora, and enhancing neural LMs. They also discuss how the approach can mitigate hallucination issues in AI-generated content by ensuring data fidelity. Moreover, the paper explores the use of infini-gram for generating accurate, contextually relevant responses and diagnostics for neural models' training data influence.
In conclusion, this research breathes new life into n-gram models, demonstrating their enduring value in the era of neural LMs. The innovation in scalable ∞-gram LMs, backed by the efficient infini-gram engine, offers powerful tools for textual analysis and significantly advances the performance of existing LMs. The authors have open-sourced the engine to spur further research, setting the stage for transformative applications in data-driven language understanding.