Daily-Omni QA Generation Pipeline
- Daily-Omni QA Generation Pipeline is a modular system that generates, filters, and optimizes QA pairs across diverse modalities and continually updated data sources.
- It employs a multi-stage architecture with BM25 retrieval, neural ranking, and neural RM3 query expansion, followed by a BERT-based reader to ensure high recall and efficiency.
- Empirical results on Wikipedia/SQuAD benchmarks (EM=58.1, F1=65.8) demonstrate its superior performance and adaptability for real-time and high-throughput applications.
A Daily-Omni QA Generation Pipeline is an end-to-end, modular system that generates, filters, and optimizes question–answer (QA) pairs—potentially across modalities—on a continual or high-frequency basis. Integrating advances from open-domain QA, information retrieval, neural ranking, query expansion, machine reading comprehension, and adaptive timing and supervision, the pipeline supports robust real-time or dynamic question–answering from large-scale unstructured corpora such as Wikipedia or daily-updated web content.
1. Multi-Stage Pipeline Architecture
A haLLMark of the Daily-Omni QA Generation Pipeline is its stratified structure, which explicitly separates retrieval, ranking, query expansion, and reading comprehension into distinct modules:
- Retriever: The process begins with lexical retrieval, typically using a fast inverted index such as BM25 implemented by Anserini (Lucene 8.0), working over paragraphs or passages with stop words removed. The retriever outputs the top candidate documents per query, providing high recall with low latency.
- Neural Ranker: The top-retrieved documents are reranked by a neural ranker, usually BERT-Base fine-tuned on MS MARCO and SQuAD for binary answerability. Paragraphs are truncated (e.g., to the first 448 tokens) for efficiency.
- Neural RM3 (Query Expansion): Instead of classic RM3 term reweighting, the neural RM3 variant computes an expanded query vector as , integrating TF-IDF vectors of paragraphs positively scored by the ranker. This expansion enables a recall boost by surfacing context absent in BM25 recall alone.
- Reader: A span extractor (e.g., BERT-Base or BERT-Large) predicts answer start/end tokens over a 384-token input. Only top-ranked passages (typically 2.5% of all) are processed at this stage, drastically reducing model inference costs.
Innovations include separating ranking from reading to allow heavier, high-capacity readers later in the pipeline and leveraging neural feedback (via neural RM3) to improve the “retrieval + reader” paradigm beyond classical techniques like DrQA or BERTserini.
2. Performance Metrics and Empirical Results
The pipeline is evaluated using established metrics, with specific attention to open-domain QA needs:
- Exact Match (EM): Percentage of predicted answers exactly matching gold annotations.
- F1 Score: Token-level overlap between prediction and reference answers.
- Recall: Measured at the retrieval and ranking stages (e.g., recall@100), denoting likelihood the answer appears among top candidates.
On the Wikipedia/SQuAD benchmark, the pipeline attains EM = 58.1 and F1 = 65.8, exceeding previous methods (e.g., BERTserini) by approximately 8 points and achieving a lower end-to-end latency (738 ms per query versus prior 887–988 ms). This improvement is attributed primarily to the more efficient and accurate filtering in the ranker and expanded recall from neural RM3.
3. Advanced Information Retrieval and Neural Query Expansion
Distinctive to this pipeline is its two-layer retrieval structure:
- First-Layer Retrieval: BM25 retrieves diverse, high-recall passage candidates using unigram matching.
- Neural RM3 Feedback: Instead of classic RM3, the pipeline forms an expanded query using term vectors from passages deemed answerable by the neural ranker. The update:
where is the original query’s TF-IDF vector, the term vector for document , and its (unnormalized) neural ranker score.
This step yields a 6-point increase in recall@100, indicating superior context coverage in difficult retrieval scenarios.
4. Machine Reading Comprehension Integration
The final reading stage is carefully optimized:
- Model: BERT encoder with an additional linear classifier for span extraction within a 384-token context.
- Selectivity: Only highly ranked passages after reranking/expansion are passed to the reader, allowing the use of heavier models without prohibitive latency.
- Robust Design: The strict separation of ranking from reading enables more flexible deployment strategies (such as choosing BERT-Base for quick settings or BERT-Large for accuracy-dominated regimes).
This design ensures high EM/F1, while reducing computational cost and answering latency.
5. Use of Low-Resolution Labels
A notable property is the incorporation of “low-resolution labels”—paragraph-level answer presence signals rather than token-level locations:
- Source: Obtained from large-scale datasets like MS MARCO, often based on user clicks or coarse relevance judgments, which are cheaper to annotate.
- Supervision: The neural ranker is trained on these labels, which allows for scale and generalization beyond the SQuAD token-level supervision.
- Benefit: Expands model applicability to settings where only user interaction data exists or detailed annotation is infeasible, improving scalability and model robustness.
The ability to leverage such supervision reduces cost and enables continual adaptation with user feedback in production deployments.
6. Adaptation to Timing and Efficiency Requirements
The pipeline is adjustable to varied application contexts:
- Retrieval Size Tuning: Lowering accelerates response time (as low as 110 ms per query with minimal accuracy loss), enabling real-time applications.
- Modularity: The clear division between retriever, ranker, and reader supports substitution with lighter (e.g., distilled) or heavier models per deployment scenario.
- Scenario Optimization: For throughput-critical environments, one can prioritize retriever efficiency; for maximum accuracy, allocate more budget to reader complexity.
This flexible configuration ensures the pipeline can operate in both interactive and analysis-heavy tasks with optimal tradeoffs.
7. Mathematical Framework and Query Expansion Formula
A key theoretical underpinning is the use of neural feedback for query expansion, crystallized in the update:
where:
- is the TF-IDF vector of the original question,
- is the TF-IDF vector for document selected by the ranker,
- denotes the neural ranker’s score,
- is the interpolation coefficient in .
This mechanism injects semantic signals from ranking back into token-based retrieval, facilitating an adaptive, high-recall candidate set for downstream reading and answer extraction.
In summary, the Daily-Omni QA Generation Pipeline, as instantiated in the Mindstone system, sets a strong baseline for open-domain answering by modularizing retrieval and reading, exploiting neural ranking and query expansion, integrating low-resolution annotation, and offering granular control of latency–accuracy trade-offs. These design principles not only achieved state-of-the-art benchmarks on Wikipedia/SQuAD (EM = 58.1, F1 = 65.8) but also established a scalable, easily tunable architecture for practical and production QA systems (Semnani et al., 2020).