Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 148 tok/s

Gemini 2.5 Pro 44 tok/s Pro

GPT-5 Medium 23 tok/s Pro

GPT-5 High 30 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 197 tok/s Pro

GPT OSS 120B 458 tok/s Pro

Claude Sonnet 4.5 38 tok/s Pro

2000 character limit reached

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens (2401.17377v4)

Published 30 Jan 2024 in cs.CL, cs.AI, and cs.IR

Abstract: Are $n$-gram LLMs still relevant in this era of neural LLMs? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing $n$-gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest $n$-gram LM ever built. Second, existing $n$-gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $\infty$-gram LM with backoff. Instead of pre-computing $n$-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$-gram (as well as $n$-gram with arbitrary $n$) probabilities with millisecond-level latency. The $\infty$-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the $\infty$-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their perplexity. When analyzing machine-generated text, we also observe irregularities in the machine--$\infty$-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers.

References (38)

Citations (29)

View on Semantic Scholar

Summary

The paper presents an ∞-gram language model that removes fixed context limits by using a suffix array engine for efficient n-gram counting.
It introduces a novel methodology that handles trillion-token datasets with just 7 bytes per token and achieves sub-second query processing on an 80-core CPU node.
Empirical findings demonstrate a 47% accuracy in next-token prediction and up to a 73% reduction in perplexity when integrated with neural language models.

Overview of Infini-gram

With the rise of neural LLMs, traditional n-gram LMs appeared to be losing relevance. However, this paper challenges that perception by unveiling the largest n-gram model ever built, trained on a whopping 1.4 trillion tokens. The authors present a novel approach: an ∞-gram LM with no fixed upper limit n for context length. This innovation enables the n-gram model to utilize arbitrarily large contexts, significantly improving its predictive performance.

Technical Innovation

Crucially, the researchers introduced a performant engine named infini-gram, sidestepping the traditional, resource-intensive n-gram count tables. Infini-gram hinges on suffix arrays, a data structure that allows rapid n-gram counting during inference. With just 7 bytes of storage per token, this arrangement showcases remarkable efficiency—building an index for Trillion-token scale data takes less than three days using a sole 80-core CPU node, boasting sub-second latency in query processing.

Empirical Findings

Empirical analyses underscore the potential of the ∞-gram LM in capturing richer textual context and complementing neural LMs. For instance, the authors report a consistent accuracy of 47% in next-token prediction for human-written text, which increases with the length of the considered textual context. More strikingly, integrating ∞-gram models with large neural models results in up to a 73% reduction in LLMing perplexity—indicative of a significant leap in performance.

Applications and Implications

The authors envision diverse applications for infini-gram, including data curation, understanding large text corpora, and enhancing neural LMs. They also discuss how the approach can mitigate hallucination issues in AI-generated content by ensuring data fidelity. Moreover, the paper explores the use of infini-gram for generating accurate, contextually relevant responses and diagnostics for neural models' training data influence.

Conclusion

In conclusion, this research breathes new life into n-gram models, demonstrating their enduring value in the era of neural LMs. The innovation in scalable ∞-gram LMs, backed by the efficient infini-gram engine, offers powerful tools for textual analysis and significantly advances the performance of existing LMs. The authors have open-sourced the engine to spur further research, setting the stage for transformative applications in data-driven language understanding.