WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation (2505.08643v1)

Published 13 May 2025 in cs.AI and cs.LG

Abstract: Retrieval-Augmented Generation (RAG) is a cornerstone of modern question answering (QA) systems, enabling grounded answers based on external knowledge. Although recent progress has been driven by open-domain datasets, enterprise QA systems need datasets that mirror the concrete, domain-specific issues users raise in day-to-day support scenarios. Critically, evaluating end-to-end RAG systems requires benchmarks comprising not only question--answer pairs but also the specific knowledge base (KB) snapshot from which answers were derived. To address this need, we introduce WixQA, a benchmark suite featuring QA datasets precisely grounded in the released KB corpus, enabling holistic evaluation of retrieval and generation components. WixQA includes three distinct QA datasets derived from Wix.com customer support interactions and grounded in a snapshot of the public Wix Help Center KB: (i) WixQA-ExpertWritten, 200 real user queries with expert-authored, multi-step answers; (ii) WixQA-Simulated, 200 expert-validated QA pairs distilled from user dialogues; and (iii) WixQA-Synthetic, 6,222 LLM-generated QA pairs, with one pair systematically derived from each article in the knowledge base. We release the KB snapshot alongside the datasets under MIT license and provide comprehensive baseline results, forming a unique benchmark for evaluating enterprise RAG systems in realistic enterprise environments.

Authors (7)

Dvir Cohen (4 papers)
Lin Burg (2 papers)
Sviatoslav Pykhnivskyi (1 paper)
Hagit Gur (1 paper)
Stanislav Kovynov (1 paper)
Olga Atzmon (1 paper)
Gilad Barkan (2 papers)

Summary

Analysis of WixQA: A Benchmark for Enterprise Retrieval-Augmented Generation

The paper "WixQA: A Multi-Dataset Benchmark for Enterprise Retrieval-Augmented Generation" addresses a notable gap in the current landscape of question answering (QA) systems, particularly within enterprise environments. Traditional QA benchmarks rely largely on open-domain datasets, which do not adequately reflect the specific requirements and challenges of enterprise settings, such as precise retrieval from domain-specific knowledge bases and support for complex, multi-step procedural responses. This research introduces WixQA, an innovative benchmark specifically designed to fill this void.

Key Components of WixQA

WixQA comprises three distinct datasets:

WixQA-ExpertWritten: This dataset includes 200 authentic queries from Wix.com users, each coupled with detailed, expert-authored answers. The answers are crafted based on a comprehensive review of multiple articles within the knowledge base, reflecting real-world challenges in enterprise support scenarios.
WixQA-Simulated: Derived from user-chatbot dialogues, this dataset consists of 200 QA pairs validated for accuracy through expert simulations. It targets concise procedural guidance, offering insights into how AI can distill information from multi-turn conversations into precise, single-turn responses.
WixQA-Synthetic: Comprising 6,221 pairs extracted from Wix articles using LLMs, this dataset integrates diverse QA pairs into the training process for robust retrieval model development, serving as a substantial resource for model training.

The essence of this benchmark lies in its grounding in a specially curated snapshot of the Wix knowledge base, which includes 6,221 articles. This serves as a foundation for both retrieval and augmentation tasks, ensuring that the generation of responses is not only accurate but also contextually appropriate and grounded in verified, domain-specific data.

Experimental Setup and Results

Employing FlashRAG, the experiments utilize two retrieval methodologies—BM25 and E5 dense retrieval—and several cutting-edge text generation models, such as Claude 3.7 and GPT-4o. Each configuration was assessed across multiple metrics, including F1, BLEU, and context recall, ensuring comprehensive evaluation.

The results highlight dense retrieval's superior performance in handling complex queries, particularly in the context of the WixQA-ExpertWritten and Simulated datasets, with significant improvement in context recall. Generation models demonstrated diverse performance measures, indicating inherent trade-offs between factual alignment and n-gram similarity. Despite reasonable results, the baseline metrics underscore the existing challenges in enterprise-centric RAG setups, emphasizing further research opportunities.

Implications and Future Directions

The introduction of WixQA as a benchmark provides significant implications for both theoretical explorations and practical advancements in AI. The datasets, distinctly varied in scale and complexity, offer a scope for developing refined QA systems that address procedural and domain-specific requirements—exemplifying the shifting focus towards enterprise adaptability.

Future developments could leverage WixQA to foster scalability in datasets, promote the integration of multi-hop retrieval challenges, and instigate enhancement in human evaluation protocols. By releasing WixQA and associated resources, this research not only delivers a robust benchmarking tool but also sets a precedent for advancing QA systems across enterprise applications.

In conclusion, WixQA represents a pivotal step towards creating more nuanced, effective QA systems responsive to enterprise-level demands, reinforcing the role of contextually nuanced data and balanced retrieval-augmentation mechanisms in AI development.

Related Papers

Find Related Papers

YouTube

Show All Videos