Overview of Infini-gram
With the rise of neural LLMs, traditional n-gram LLMs (LMs) appeared to be losing relevance. However, this paper challenges that perception by unveiling the largest n-gram model ever built, trained on a whopping 1.4 trillion tokens. The authors present a novel approach: an ∞-gram LM with no fixed upper limit n for context length. This innovation enables the n-gram model to utilize arbitrarily large contexts, significantly improving its predictive performance.
Technical Innovation
Crucially, the researchers introduced a performant engine named infini-gram, sidestepping the traditional, resource-intensive n-gram count tables. Infini-gram hinges on suffix arrays, a data structure that allows rapid n-gram counting during inference. With just 7 bytes of storage per token, this arrangement showcases remarkable efficiency—building an index for Trillion-token scale data takes less than three days using a sole 80-core CPU node, boasting sub-second latency in query processing.
Empirical Findings
Empirical analyses underscore the potential of the ∞-gram LM in capturing richer textual context and complementing neural LMs. For instance, the authors report a consistent accuracy of 47% in next-token prediction for human-written text, which increases with the length of the considered textual context. More strikingly, integrating ∞-gram models with large neural models results in up to a 73% reduction in LLMing perplexity—indicative of a significant leap in performance.
Applications and Implications
The authors envision diverse applications for infini-gram, including data curation, understanding large text corpora, and enhancing neural LMs. They also discuss how the approach can mitigate hallucination issues in AI-generated content by ensuring data fidelity. Moreover, the paper explores the use of infini-gram for generating accurate, contextually relevant responses and diagnostics for neural models' training data influence.
Conclusion
In conclusion, this research breathes new life into n-gram models, demonstrating their enduring value in the era of neural LMs. The innovation in scalable ∞-gram LMs, backed by the efficient infini-gram engine, offers powerful tools for textual analysis and significantly advances the performance of existing LMs. The authors have open-sourced the engine to spur further research, setting the stage for transformative applications in data-driven language understanding.