Mixture-of-PageRanks: Replacing Long-Context with Real-Time, Sparse GraphRAG (2412.06078v1)

Published 8 Dec 2024 in cs.IR and cs.LG

Abstract: Recent advances have extended the context window of frontier LLMs dramatically, from a few thousand tokens up to millions, enabling entire books and codebases to fit into context. However, the compute costs of inferencing long-context LLMs are massive and often prohibitive in practice. RAG offers an efficient and effective alternative: retrieve and process only the subset of the context most important for the current task. Although promising, recent work applying RAG to long-context tasks has two core limitations: 1) there has been little focus on making the RAG pipeline compute efficient, and 2) such works only test on simple QA tasks, and their performance on more challenging tasks is unclear. To address this, we develop an algorithm based on PageRank, a graph-based retrieval algorithm, which we call mixture-of-PageRanks (MixPR). MixPR uses a mixture of PageRank-based graph-retrieval algorithms implemented using sparse matrices for efficent, cheap retrieval that can deal with a variety of complex tasks. Our MixPR retriever achieves state-of-the-art results across a wide range of long-context benchmark tasks, outperforming both existing RAG methods, specialized retrieval architectures, and long-context LLMs despite being far more compute efficient. Due to using sparse embeddings, our retriever is extremely compute efficient, capable of embedding and retrieving millions of tokens within a few seconds and runs entirely on CPU.

Summary

The paper introduces MixPR, a novel retrieval method that dynamically adjusts Personalized PageRank parameters to emphasize task-relevant queries.
The paper demonstrates that sparse embeddings enable rapid processing of millions of tokens on CPUs, significantly reducing computational expenses.
The paper shows MixPR’s empirical success across benchmarks with models like GPT-4o-mini and Llama3.1-8B, achieving near state-of-the-art performance.

An Evaluation of Mixture-of-PageRanks: Enhancing Long-Context Processing with Graph-Based Retrieval

The rapid expansion of the context window for LLMs to accommodate millions of tokens marks a significant advancement in the field. However, the increasing computational expenses associated with processing long contexts pose practical challenges. The paper "Mixture-of-PageRanks: Replacing Long-Context with Real-Time, Sparse GraphRAG" presents a novel retrieval approach aimed at optimizing Retrieval-Augmented Generation (RAG) methods while addressing these computational costs.

The authors introduce a new algorithm, MixPR, which utilizes a modified version of Personalized PageRank (PPR) for graph-based retrieval. This algorithm dynamically tweaks the influence of task-specific queries to enhance retrieval efficacy across various long-context tasks, including reasoning, summarization, and question-answering. MixPR has shown to outperform both traditional RAG methods and state-of-the-art (SOTA) long-context models, all while significantly reducing computation burdens through the use of sparse matrix embeddings that operate efficiently on CPUs.

Key Contributions and Methodology

Novel Algorithmic Approach: The paper contributes a new retrieval algorithm in the form of MixPR, which enhances PageRank by adjusting a dynamic hyper-parameter based on the task. Through this adaptive mechanism, MixPR can seamlessly toggle between query-relatedness and structural importance, better addressing the complexity of multi-step retrieval tasks.
Efficient Sparse Representations: Employing sparse embeddings facilitates rapid context processing, making the retrieval system capable of dealing with millions of tokens in seconds using CPU resources. This contrasts with dense embedding frameworks typically limited by GPU performance and shows a significant leap in computational efficiency.
Empirical Validation: Extensive testing of MixPR was undertaken using respected LLMs, such as GPT-4o-mini and Llama3.1-8B, across several long-context benchmarks. MixPR consistently achieved, or approached, SOTA performance on three out of four benchmarks, with particular successes in the BABILong and RULER datasets.
Reduced Computational Costs: The implementation's compute efficiency is highlighted by its ability to run effectively in real-time while retaining the robust performance capabilities of extended context LLMs.
Task Classification and Scalability: The paper introduces a query classifier enabled through zero-shot prompting, which differentiates global and local retrieval tasks, providing accuracy across the tested tasks. This is critical given the varied nature of challenges posed by both the input structure and task requirements.

Numerical and Computational Insights

MixPR's performance improvements are accompanied by reductions in compute time when compared to established dense embedders. For instance, when benchmarked, MixPR demonstrated rapid embedding speed for over a million tokens, significantly outperforming dense models operating on GPU setups. It effectively retained retrieval performance where retrieval precision was qualitatively assessed (e.g., using ROUGE and BertF1 metrics).

Theoretical and Practical Implications

The implicit Bayesian inference employed by MixPR in balancing input likelihood with data-independent structural priors sets a precedent in task adaptability. The methodology showcases an evolution in RAG paradigms where sparse, graph-based retrieval can complement LLM reasoning without inflating computational costs.

Practically, this advances real-time applications of LLMs with extensive contexts on conventional CPUs, merging accessibility with high performance. Future developments in this domain may focus on further optimization of the MixPR approach and its integration with emerging LLM architectures, or enhance graph structures with additional semantic layers, improving retrieval dynamics even further.

The paper reflects a sophisticated understanding of retrieval complexities in long-context tasks, providing a rigorous, pragmatic framework for efficient LLM deployment moving forward.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (2)

Tweets

https://twitter.com/rohanpaul_ai/status/1868299266906833296

https://twitter.com/papers_anon/status/1866345177876779349

https://twitter.com/rawsh0/status/1873830379452543170

https://twitter.com/papers_anon/status/1870208733500051829

https://twitter.com/dirtmunch3r/status/1911853285545304268