Unsupervised QA Generation

Updated 21 April 2026

Unsupervised QA Generation is a method that automatically constructs question-answer pairs from unlabeled text by selecting answer spans and synthesizing questions.
It employs techniques like cloze-style masking, neural machine translation, and LLM-driven prompts to transform raw passages into diverse and competitive QA datasets.
This approach reduces reliance on expensive human annotation, achieving 60–80% of supervised performance on benchmarks through filtering and augmentation strategies.

Unsupervised Question Answer Generation (Unsupervised QA Generation) refers to the construction of question–answer (QA) datasets entirely from unlabeled corpora, without recourse to any human-annotated QA pairs. This paradigm is motivated by the expensive and slow process of large-scale data annotation in open-domain, domain-specific, or cross-lingual QA, and it has been realized through a spectrum of algorithmic approaches spanning entity/noun-phrase extraction, cloze-style masking, neural machine translation, information extraction, semantic role heuristics, prompt-based LLMs, and unsupervised augmentation. Such methods unlock the training of competitive QA systems—both extractive and generative—using only raw text and general-purpose tools.

1. Core Principles and Motivation

Unsupervised QA Generation fundamentally decomposes QA data creation into: (1) answer selection from unlabeled text, (2) question synthesis (often via cloze→interrogative transformation), and (3) context construction and pairing, all in the absence of ground-truth QA alignments. This field is motivated by the observation that most real QA answers are not limited to named entities, that QA datasets’ answer distributions are highly heterogeneous, and that existing supervised approaches require costly annotation and struggle with domain adaptation. Empirically, unsupervised pipelines can yield strong downstream QA models that in some cases reach 60–80% of fully supervised performance on benchmarks such as SQuAD v1.1, Natural Questions, and TriviaQA using only synthetic data (Lewis et al., 2019, Li et al., 2020, Zhu et al., 2021, Nie et al., 2022).

Key methodological drivers include:

Exploiting linguistic structure (NER, constituency, OpenIE, dependency parses) to select high-utility answer spans.
Generating cloze-style (“fill-in-the-blank”) questions and translating them to natural-language interrogatives.
Leveraging neural machine translation (often unsupervised) between cloze templates and natural questions.
Applying filtering, denoising, paraphrasing, and augmentation to inject syntactic and answer-type diversity.
Bootstrapping off large, unlabeled text collections or domain knowledge sources (unstructured text, tables, KGs).

2. Foundational Pipelines and Evolution

The canonical unsupervised QA data generation pipeline originated with cloze translation approaches:

Step	Representative Methods	Key Reference IDs
Answer span selection	NER/NP extraction, OpenIE	(Lewis et al., 2019, Nie et al., 2022)
Cloze question generation	Mask answer, extract clause	(Lewis et al., 2019, Li et al., 2020)
Cloze→natural question	Rule-based, UNMT, templates	(Lewis et al., 2019, Fabbri et al., 2020)
Context–question pairing	Sentence/retrieval, alignment	(Fabbri et al., 2020, Li et al., 2020)
Data augmentation/denoising	Paraphrasing, adversarial, filter	(Nie et al., 2022, Nagumothu et al., 2023, Zhou et al., 2 Aug 2025)

In early work (Lewis et al., 2019), named entities or noun phrases are randomly sampled as candidate answers, corresponding sub-clauses are extracted, and answers replaced by a slot. An unsupervised NMT model is then trained to translate cloze statements into natural questions. The resulting synthetic (context, question, answer) triples are filtered, and used to fine-tune extractive QA models. This approach demonstrated that, with BERT-large, performance of 56.4 F1 on SQuAD v1.1 is attainable with zero annotated pairs.

Template-based and retrieval-augmented approaches (e.g. (Fabbri et al., 2020)) generate questions using related—but not identical—sentences retrieved from the corpus, masked at the answer span, and transformed into interrogatives with type-aware templates. This increases paraphrastic diversity and contextual complexity, yielding further performance gains.

Refinements such as paraphrasing (e.g., via PEGASUS) plus OpenIE extraction (Nagumothu et al., 2023), or leveraging citation graphs for wider lexical/syntactic diversity (Li et al., 2020), improve both QA diversity and performance, while adversarial augmentation (Nie et al., 2022) and robust denoising filters mitigate noise and overfitting.

3. Methodological Innovations

Expansions and improvements in the unsupervised QA generation paradigm include:

Extension to answer diversity: DiverseQA (Nie et al., 2022) recognizes that NE-only extraction covers only ~40–50% of gold answers. This method iteratively extends NE spans to maximal constituents (NP, VP, S, ADJP) according to syntactic constraints, enabling the generation of QA pairs with both short and long, factoid or descriptive answers. It further introduces answer-type-aware adversarial data augmentation at the embedding level and downstream confidence-based denoising.
Multi-hop and cross-modality QA generation: MQA-QG (Pan et al., 2020) constructs reasoning graphs that leverage multi-hop connections between texts or table–text pairs. It composes bridge and comparison operators over contexts, fuses single-hop QAs using BERT-based blending, and ranks with fluency-based metrics. This produces synthetic multi-hop QA datasets achieving >80% of fully supervised F1 on HotpotQA and 61% on HybridQA.
Domain-specific pipelines: PIE-QG (Nagumothu et al., 2023) integrates information extraction, paraphrasing, and triple merging to address QA generation in low-resource settings (e.g., small corpora), while D-SCoRE (Zhou et al., 2 Aug 2025) applies LLM-based chain-of-thought prompting, explicit/implicit question balancing, and semantic-role augmentation to generate richly annotated, structured QA-CoT pairs for domain SFT.
LLM/Purely prompt-based data generation: Recent methods exploit pre-trained LLMs to generate (Q,A) pairs from raw passages via carefully engineered prompts, chain-of-thought strategies, and output filtering (Zhang et al., 2023, Zhou et al., 2 Aug 2025). These approaches provide scalability, multi-difficulty balancing, and the flexibility to target explicit/inferential and factoid/counterfactual QA properties.
Unsupervised multiple-choice QA: The construction of fully synthetic multiple-choice QA datasets incorporates answer extraction, cloze→question translation, and plausible distractor mining via knowledge graphs or NER type matching, supporting zero-annotation MCQA (Zhang et al., 2024).

4. Empirical Results and Benchmark Assessments

Unsupervised QA generation pipelines have matured to the point where they consistently approach (or even surpass, under low-resource constraints) early supervised QA systems on standard benchmarks.

Selected performance highlights across recent works:

Method / Paper	SQuAD v1.1 F1	NQ F1	TriviaQA F1	NewsQA F1	Special Note
Unsupervised NMT+Cloze (Lewis et al., 2019)	54.7	35.1	23.8	–	Named entity test F1=64.5
RefQA (+Refine) (Li et al., 2020)	71.4	–	–	45.1	Outperforms prior unsupervised, competitive w/ early sup
Template-based (Fabbri et al., 2020)	64.0	–	–	–	State-of-the-art on NER-based SQuAD, F1=77.55
PIE-QG (Nagumothu et al., 2023)	72.6	–	–	–	On par with best refined pipelines, using 10x less data
DiverseQA (Nie et al., 2022)	76.9	60.8	51.3	61.4	Improves non-NE and long-answer F1 by 4–8 points
MQA-QG (Pan et al., 2020) (HotpotQA)	–	68.6	–	–	83% of supervised F1 in multi-hop regime
UODQA (Zhu et al., 2021) (TQA, EM)	–	–	50.2	–	86% of supervised retriever-reader
D-SCoRE (Zhou et al., 2 Aug 2025) (SQuADShifts F1)	55.2–64.7	–	–	–	Outperforms annotation data in downstream SFT
Self-QA (Zhang et al., 2023) (domain F1)	57.1–54.0	–	–	–	Beats Self-Instruct, strong on finance/medical/legal

Ablations commonly find that NE→constituent extension, paraphrasing, coreference resolution, and open-domain retrieval for question formation each add 1–7 F1, while denoising or answer-type-aware adversarial augmentation further improve performance, especially on non-NE and long-span answers. Notably, unsupervised QA pretraining strongly boosts few-shot performance in low-resource settings, narrowing the gap to fully supervised models when only a handful of real examples are available (Li et al., 2020, Nie et al., 2022).

5. Diversity, Answerability, and Denoising Strategies

Recent work emphasizes that naïve NE-only sampling leads to synthetic datasets with pathological answer distributions, poor alignment with real benchmarks, and poor generalization to non-NE answers (Nie et al., 2022). Extensions introduce:

Constituent-based span extension: Iteratively enlarging answer spans to match constituent structure, greatly improving coverage of NP, VP, subclause, and descriptive answers.
OpenIE and paraphrasing: Flexible triple extraction from paraphrased contexts yields more diverse, semantically distinct QA pairs (Nagumothu et al., 2023).
Type-aware question synthesis: Templates or sequence-to-sequence models select wh-words by answer span type (e.g., WHERE for LOC, WHEN for DATE).
Adversarial augmentation: Adapting the embedding space and adding robust perturbations aligned with answer type supports better generalization and robustness (Nie et al., 2022).
Denoising filters: Model confidence and answer consistency constraints prune noisy examples at training, boosting QA reliability.

In multi-hop settings, information fusion and filtering (fluency, answer consistency) are crucial to constructing nontrivial multi-step questions (Pan et al., 2020).

6. Expansions: MCQA, Clinical, and LLM-Driven QA Generation

Unsupervised MCQA extends the paradigm to construction of datasets with plausible distractors via hybrid NE-type and KG-based methods (Zhang et al., 2024). Pipeline design incorporates answer extraction, cloze-to-question translation, and entity similarity-scoring for distractor mining. Fine-tuning LLMs on such synthetic MCQA achieves 77.7% accuracy on ARC.

Clinical QA settings employ LLMs with non-overlap or schema-based prompts to induce challenging, “non-text-matching” questions requiring clinical reasoning, with synthetic QA improving fine-tuning F1 by 6.9–10 over baselines. Ensuring answer fidelity and diversity remains an open challenge (Bai et al., 2024).

LLM approaches, typified by Self-QA (Zhang et al., 2023) and D-SCoRE (Zhou et al., 2 Aug 2025), use LLMs to synthesize diverse, domain-specific instruction–answer pairs directly via prompt engineering, CoT reasoning, and structured JSON export, scaling to millions of QA pairs without external labeling.

7. Limitations, Open Challenges, and Future Directions

Despite substantial progress, unsupervised QA Generation faces the following challenges:

Dependency on linguistic analyzers (NER, OpenIE, coref) and pre-trained models constrains cross-lingual and low-resource adoption.
Answer span misalignment with gold data remains a major bottleneck; improved selection (OpenIE, semantic roles) is required (Lyu et al., 2021).
Fluency vs. faithfulness: Neural question generation/paraphrasing may improve syntax but can drift semantically, hurting answerability (Pan et al., 2020). Precision over linguistic quality remains key.
Scaling to “why” and multi-span questions, as well as self-consistent unanswerable QA, is only nascent (Nikolenko et al., 2020). Most frameworks focus on extractive, factoid settings.
Quality of distractors in MCQA or counterfactual choices in generative pipelines is variable and impacts downstream performance (Zhang et al., 2024, Zhou et al., 2 Aug 2025).
Human-level evaluation and metric selection: While EM/F1/accuracy are standard, further assessment of diversity, informativeness, and domain transferability is desirable.

Future directions include integration of round-trip and self-consistency checks, richer fusion of heterogeneous data (tabular, KB, text), LLM-powered iterative bootstrap with filtering, and minimal-human-in-the-loop or hybrid semi-supervised systems. Persistent improvements in span selection, noise reduction, and domain adaptation are likely to further close the gap with manual annotation-based QA.

Key References:

"Unsupervised Question Answering by Cloze Translation" (Lewis et al., 2019)
"Template-Based Question Generation from Retrieved Sentences…" (Fabbri et al., 2020)
"Harvesting and Refining Question-Answer Pairs for Unsupervised QA" (Li et al., 2020)
"Unsupervised Open-Domain Question Answering" (Zhu et al., 2021)
"Unsupervised Question Answering via Answer Diversifying" (Nie et al., 2022)
"PIE-QG: Paraphrased Information Extraction…" (Nagumothu et al., 2023)
"Self-QA: Unsupervised Knowledge Guided LLM Alignment" (Zhang et al., 2023)
"Unsupervised multiple choices question answering via universal corpus" (Zhang et al., 2024)
"D-SCoRE: Document-Centric Segmentation and CoT Reasoning…" (Zhou et al., 2 Aug 2025)
"Give me Some Hard Questions: Synthetic Data Generation for Clinical QA" (Bai et al., 2024)
"Improving Unsupervised Question Answering via Summarization-Informed Question Generation" (Lyu et al., 2021)
"Unsupervised Multi-hop Question Answering by Question Generation" (Pan et al., 2020)