- The paper introduces EnronQA, a large-scale benchmark enabling personalized RAG over private documents.
- It details a multi-stage QA generation pipeline using LLMs to create over 528K high-quality question-answer pairs from 103K emails.
- Benchmarking reveals that retrieval quality strongly influences RAG performance, with BM25 unexpectedly excelling in this domain.
The paper "EnronQA: Towards Personalized RAG over Private Documents" (2505.00263) introduces a new benchmark dataset designed to address the limitations of existing Retrieval Augmented Generation (RAG) benchmarks, particularly their reliance on public data that LLMs may have already memorized, and their lack of focus on private or personalized data settings.
The core contribution is the release of the EnronQA benchmark, built upon a cleaned version of the Enron emails corpus. This dataset comprises 103,638 emails sourced from 150 distinct user inboxes, paired with an extensive collection of 528,304 question-answer pairs. The scale and segmentation of the data across different users are highlighted as key features enabling research into personalized RAG over private documents.
The construction of the EnronQA dataset involved a multi-stage pipeline:
- Corpus Filtering: Applying data filtering techniques similar to those used in pretraining large text corpora. This included minhash and subset deduplication to remove duplicates and email threads, quality filters based on document length and character ratios, language identification to ensure English content, and NSFW/toxicity filters for privacy and professionalism. The filtering process significantly reduced the initial corpus size while maintaining a substantial collection of emails. Additionally, documents from the ConcurrentQA benchmark's Enron subset were re-integrated to ensure compatibility.
- QA Generation Pipeline: A rigorous, multi-stage compound LLM system implemented using DSPy was used to generate high-quality questions. This pipeline involved:
- Initial Generation: An LLM (Llama3.1-70b) generates an initial question based on an email and prior questions for that email.
- Evaluation: Each generated question is assessed against four criteria: Specificity (can an LLM identify the correct email among similar ones?), Objectivity (do different LLMs produce the same answer?), Groundedness (can LLMs not answer without the email context?), and Quality (evaluated by an LLM judge aligned with human judgment).
- Feedback Generation: Based on which evaluation step failed, specific feedback is generated.
- Refinement: If a question fails evaluation, it is refined by an LLM (with optimized prompts) based on the feedback (e.g., using retrieved emails for specificity feedback) and re-evaluated.
This process ensures high-quality, specific, objective, and grounded QA pairs.
- Additional Data Processing: Rephrased versions of the questions were generated using an LLM-based pipeline to support different downstream tasks, such as training models for memorization.
The paper demonstrates the utility of EnronQA through two case studies:
- Benchmark Calibration: It shows that unlike prominent public benchmarks like NaturalQuestions [47761] and TriviaQA [2017arXivtriviaqa], LLMs score very low on EnronQA without retrieval. This indicates that the knowledge in EnronQA is largely not pre-memorized by these models, making it a better benchmark for measuring the true impact of retriever quality on RAG performance. Accuracy on EnronQA scales linearly with retrieval recall, offering clearer insights into pipeline improvements.
- Benchmarking RAG Pipelines: The paper provides baseline performance numbers for various combinations of retrievers (BM25, ColBERTv2), LLMs (Llama 3.1 8B/70B, GPT4o), and RAG architectures (No Query Rewrite, Query Rewrite) on EnronQA. Surprisingly, simple lexical search like BM25 performs well due to the specific nature of the questions generated. Query rewriting did not consistently improve performance on this benchmark.
- Case Study: Memorized Knowledge: The paper explores the potential of training LLMs to memorize factual knowledge as an alternative to traditional RAG. Using LoRA adapters to fine-tune a model on QA pairs from EnronQA, they show that memorization can achieve performance comparable to including all facts in the prompt context ("Long Context"), and scales beyond the context window limits. While traditional RAG still outperforms memorization in this simplified setting, the results suggest that continued pretraining and fine-tuning for memorization are promising future research directions, for which EnronQA serves as a realistic testbed.
The authors highlight that LLM self-verifying and optimizing pipelines can be powerful tools for generating high-quality synthetic data, as demonstrated by their QA generation process. They also emphasize the potential of memorization methods (like fine-tuning or continued pretraining) to complement or potentially replace traditional retrieval in the future.
An Ethics Statement is included, acknowledging the sensitive nature of the Enron email data. The authors state they used a cleaned version of the corpus, applied content filters, and are committed to addressing any data removal requests from affected parties.
In conclusion, EnronQA is presented as a valuable new resource for the community, enabling better benchmarking of RAG pipelines, exploring personalized and private retrieval settings, and facilitating research into LLM memorization and continued pretraining for information retrieval tasks over non-public data.