Generative Question Answering (GQA)
- Generative Question Answering is a method that generates free-form answers by synthesizing, paraphrasing, and integrating evidence rather than merely extracting text.
- It leverages Transformer-based encoder-decoder architectures with pointer-generator, coverage, and retrieval-augmented mechanisms to enhance factual accuracy and handle open-ended queries.
- Recent advancements focus on mitigating hallucinations, improving multi-hop reasoning, and achieving domain adaptation for more robust and scalable QA systems.
Generative Question Answering (GQA) refers to the family of methods that generate natural-language answers to questions, potentially synthesizing information, paraphrasing, or combining evidence, rather than simply extracting spans or retrieving facts verbatim. GQA models leverage neural architectures—initially recurrent encoder-decoder networks and now predominantly Transformer-based sequence-to-sequence models—and often incorporate external knowledge sources, retrieval, numerical reasoning, and hybrid discriminative-generative mechanisms. The field addresses challenges well beyond the limitations of extractive QA, enabling diverse answer synthesis, multi-hop reasoning, and robust handling of open-ended or abstractive questions.
1. Foundations of Generative Question Answering
The central distinction in GQA lies in the output modality: answers are generated token-by-token, enabling paraphrasing, synthesis, and abstraction, instead of being restricted to selecting subsequences or categorical labels. Early models such as GENQA unified encoder-decoder networks with external knowledge-base querying, employing attention-based and retrieval-augmented ingredients to enable answer generation grounded in structured data (Yin et al., 2015).
Subsequent models generalized the paradigm to open-domain QA on unstructured text corpora, leveraging multi-layered attention, pointer-generator networks (enabling factual copying), and hybrid loss functions. The generation objective typically maximizes the likelihood of gold answer sequences conditioned on the input question and supporting context: Advancements have introduced copying and coverage mechanisms to ensure factual consistency and reduce repetition (Mitra, 2017, Song et al., 2017).
2. Core Architectures and Mechanisms
Sequence-to-Sequence Backbones
Encoder–decoder Transformers (e.g., T5, BART, GPT-2, LLaMA) form the standard backbone for GQA. Architectures are frequently augmented by:
- Pointer-generator/copy networks: Allow direct reproduction of source tokens (names, numbers, entities), crucial for factual accuracy and handling OOV content (Mitra, 2017, Song et al., 2017, Wang et al., 2017).
- Coverage vectors: Track cumulative attention for each source position, discouraging repetition and ensuring all relevant evidence is covered (Mitra, 2017, Song et al., 2017).
- Multi-perspective matching/dual-attention: Compare question and context along various similarity axes, improving alignment and robustness (Song et al., 2017).
- Span-index or extractive guidance: Recent models train generative models to output answer indexes (sentence/token) rather than spans or free text, circumventing label sparsity in extractive QA (Mallick et al., 2023).
Integration with Knowledge Retrieval
GQA systems differ in how they incorporate external knowledge:
- Memory-augmented architectures: Explicitly encode and attend over key–value memory stores (KBs, tabular data, or passage-level knowledge) (Yin et al., 2015, Tu et al., 2018).
- Retrieval-augmented generation: Combine dense retrievers or hybrid generator-retriever stacks to access relevant context at inference time. E.g., the Generator-Retriever-Generator (GRG) pipeline leverages both synthetic and retrieved contexts to maximize knowledge coverage (Abdallah et al., 2023), while models like R-GQA retrieve in-context demonstrations to steer generation (Du et al., 2022).
- Unified generative retrieval + QA: Joint optimization of retrieval (docid or passage generation) and answer generation using a shared encoder, often bolstered by LLM-generated connectors/adapters (Li et al., 2023).
3. Training Objectives and Losses
GQA models employ a combination of sequence likelihood, auxiliary, and hybrid objectives:
- Maximum likelihood estimation: Core objective over gold answer sequences, tokenized as .
- Reinforcement learning (policy gradient): ROUGE- or task-metric-oriented reinforcement for exposure bias reduction and direct optimization of downstream metrics (Song et al., 2017).
- Multi-task/joint objectives: Simultaneous training on QA, question generation, evidence generation, and restoration tasks, as in joint QA/QG frameworks (Wang et al., 2017). In advanced models (e.g., EATQA), cross-losses (KL-divergence) distill evidence-aware answering into standard QA heads to mitigate hallucinations (Du et al., 27 Aug 2024).
- Mixed generative + extractive losses: Span-prediction terms encourage the model to align generated outputs with explicit answer spans in the context (Xu et al., 2021).
4. Evaluation Protocols and Benchmarks
GQA systems are evaluated on a combination of free-form generation and extractive metrics, including:
- BLEU-n, ROUGE-L: Match generated answers/questions to gold references by n-gram overlap. Noted to only weakly correlate with answerability and human judgment (Klein et al., 2019).
- Exact Match (EM), token-level F1: For extractive and answer span tasks, extracted/generated answers must strictly or partially match the gold answer in context (Wang et al., 2017, Xu et al., 2021).
- Human evaluation: Manual ratings of fluency, factuality, informativeness (especially relevant in settings with substantial abstraction or open-ended answer spaces) (Tu et al., 2018).
- Diversity, informativeness: Number of unique n-grams, proportion of distinct answers or phrase types (Tu et al., 2018).
Representative datasets include SQuAD v1.1/v2, MS MARCO, MultiSpanQA, ACE-05, BioASQ, TriviaQA, NaturalQuestions, QASPER, and multi-modal datasets for spoken QA (e.g., NMSQA).
5. Advanced Directions: Hallucination Mitigation, Multi-hop Reasoning, and Domain Adaptation
Hallucination Mitigation
Generative models risk producing "hallucinated" answers unsupported by the context. Recent frameworks (e.g., EATQA) formalize hallucination reduction via evidence-enhanced triplet generation—requiring the model to cyclically generate answers, supporting evidence, and reconstruct the question, alongside distillation losses that bridge the evidence-conditional and evidence-absent answer distributions (Du et al., 27 Aug 2024). Empirically, such triplet reasoning increases EM/F1 and reduces hallucination rates over standard LLM fine-tuning.
Multi-hop and Complex Reasoning
GQA systems exhibit marked shortcomings in zero-shot multi-hop compositionality when trained on single-hop data (Jiang et al., 2022). Augmenting training via concatenation of single-hop decompositions or logical-form supervision (SPARQL) boosts multi-hop EM performance by 7–20 points but does not fully close the gap to models trained directly on multi-hop data. Multi-hop faithfulness thus requires explicit supervision or structured modeling.
Numerical and Non-Textual Answering
Specialized GQA models address open-vocabulary outputs not limited to lexicalized text, such as numbers unseen in training. Hybrid word–character schemes, memory attention over key–value floats, and retrieved answer priors enable fluent yet precise numerical generation, as exemplified in StockQA (Tu et al., 2018).
In the speech/textless domain, GSQA demonstrates that an end-to-end sequence-to-sequence model operating on discrete acoustic units can transfer abstractive skills acquired from text QA data to speech QA—achieving robust extractive and zero-shot abstractive performance, with resilience to ASR errors (Shih et al., 2023).
6. Semi-Supervised, Few-Shot, and Unified Paradigms
Recent GQA approaches address label scarcity and domain transfer by integrating semi-supervised pipelines, leveraging unlabeled passages for joint question and answer generation via collaborative feedback (Klein et al., 2019). Retrieval-augmented GQA with similarity gating and few-shot sampling (e.g., JointEnc cluster-based selection) outperforms uniform or active learning in low-data regimes, maintaining domain and event-type coverage (Du et al., 2022). Unified generative retrieval–QA frameworks (e.g., UniGen) break the pipeline modularity by co-optimizing document retrieval and answer generation with LLM-generated connectors, systematically improving retrieval recall and open-book QA metrics (Li et al., 2023).
7. Open Problems, Limitations, and Future Directions
GQA remains challenged by:
- Hallucination control: Despite explicit triplet modeling and regularization, faithfulness is still imperfect, especially as context length increases and evidence chains become less explicit (Du et al., 27 Aug 2024).
- Multi-hop compositionality: Models do not naturally compose single-hop skills into robust multi-hop reasoning and require structural or logical-form intervention (Jiang et al., 2022).
- Abstractive domain transfer: While transfer from text to speech or low-resource domains is promising, domain mismatch and lack of high-quality unlabeled inference data limit ultimate generalizability (Shih et al., 2023).
- Retrieval–generation integration: Unified generative pipelines outperform modular ones, but failure to explicitly aggregate evidence across multiple documents or compose answer rationales limits performance on list or multi-answer queries (Li et al., 2023).
- Label sparsity under complex supervision: Generative index-generation for extractive QA addresses sparsity but may need hierarchical or rationale-aware design for scaling to long contexts (Mallick et al., 2023).
Future directions include richer iterative connector/generator frameworks, reinforcement-driven evidence selection, scaled joint architectures for multi-hop and open-domain QA, advanced number/symbol modeling, and further extension to multilingual, multi-modal, and resource-constrained settings.