Dice Question Streamline Icon: https://streamlinehq.com

Generating realistic domain‑relevant QA data for RAG systems

Develop methods to generate realistic, domain‑relevant question–answer pairs to support testing and validation of Retrieval‑Augmented Generation (RAG) systems that index unstructured documents in application‑specific domains.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper notes that RAG systems often lack application‑specific ground truth because they operate over unstructured documents, making traditional test datasets unavailable or insufficient. This creates a need for synthetic or automatically generated questions and answers tailored to the specific domain and document collection.

While there is emerging work using LLMs to generate questions from multiple documents, the authors emphasize that producing realistic, domain‑relevant question–answer pairs remains unresolved, hindering systematic testing and monitoring of RAG systems.

References

How to generate realistic domain relevant questions and answers remains an open problem.

Seven Failure Points When Engineering a Retrieval Augmented Generation System (2401.05856 - Barnett et al., 11 Jan 2024) in Section 6.3 (Testing and Monitoring RAG systems)