Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

131 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Document-Grounded Question Generation

Updated 1 July 2025

Document-grounded question generation is the automatic creation of questions explicitly linked to textual document content, ensuring relevance and answerability.
Core methodologies involve encoding document segments, selecting question types based on content features, and decoding questions anchored to specific passages.
This technique significantly benefits data augmentation for QA systems, improves reading comprehension tools, and enhances interactive conversational agents by generating diverse, content-rich questions.

Document-grounded question generation refers to the automated creation of natural language questions that are explicitly conditioned on the content, structure, or semantics of a textual document. This paradigm is motivated by the need to produce questions that are answerable, contextually relevant, and diverse, supporting downstream applications in reading comprehension, conversational systems, educational assessment, and data augmentation for question answering (QA) models. Document-grounded question generation differs from sentence-level or open-ended question generation by requiring models to identify salient information and generate inquiries that are explicitly anchored to specific document spans, entities, or discourse phenomena.

1. Core Architectures and Methodological Principles

Document-grounded question generation frameworks are typically constructed as probabilistic, modular systems that encode the source document, determine appropriate question types and targets, and decode questions in natural language. The foundational architecture proposed in "Automatic Generation of Grounded Visual Questions" (Zhang et al.) employs a tripartite structure:

Input Segmentation and Representation: Documents are partitioned into contextually meaningful segments (e.g., image regions with dense captions, as in vision, or text passages/sentences in pure NLP). Each segment is embedded using context-sensitive encoders (LSTM, VGG-16 for vision; Transformer-based models for text).
Question Type Selection: For each segment, a parameterized model (e.g., LSTM followed by softmax) predicts the probability distribution over possible question types (who, what, where, when, why, how), conditioning on content features.
Joint Decoding of Questions: A neural decoder, initialized with representations of the selected segment and question type, generates the question. The decoder is augmented with an n-gram or bigram LLM, interpolated with neural predictions to enhance grammaticality and reduce repetitive outputs.

The overall joint probability of generating a question, its type, and associated segment is factorized as: $P(\mathbf{q}_n, \mathbf{t}_n, \mathbf{c}_n | \mathbf{x}_i, \mathcal{C}_i; \bm{\theta}) = P(\mathbf{q}_n | \mathbf{c}_n, \mathbf{x}_i, \mathbf{t}_n ; \bm{\theta}_q)\, P(\mathbf{t}_n |\mathbf{c}_n ; \bm{\theta}_t) P(\mathbf{c}_n|\mathcal{C}_i)$ This structured approach enables independent modeling and subsequent fusion of content grounding, question-type selection, and syntactic realization.

2. Content Grounding and Segment-Level Alignment

A defining feature is the explicit anchoring of generated questions to concrete document regions or passages. In the case of visual grounding, captions localized to image regions are fused with image embeddings in a "correlation module" to obtain a joint semantic representation. For text documents, this principle is directly transferable: passages or sentences are segmented and scored for informativeness or saliency (style analogues to region proposals in visual grounding).

The model achieves grounding by:

Encoding each region or passage alongside local (visual or textual) features.
Sampling question types per segment based on context-appropriate priors.
Conditioning decoding on the fused segment–question type representation, ensuring generated questions are faithful to the underlying content.

For text documents, grounding may further leverage extractive preselection (via summarization attention or saliency scoring) and utilize contextual embeddings (e.g., BERT, RoBERTa) for richer representation of passages.

3. Question Type Sampling and Diversity

Maintaining semantic diversity and appropriateness of question types is realized through a learned sampler that maps segment representations to multinomial distributions over question types: $P(\mathbf{t}_n | \mathbf{c}_n) = \text{Softmax}(LSTM(\mathbf{c}_n))$ The sampling process is learned to reflect statistical associations between content cues (such as entity, event, or attribute presence) and plausible question types (e.g., "what color" for color descriptions, "who" for agentive mentions).

Crucially, by sampling multiple types per document, the model avoids model- or dataset-induced bias toward any single question category, yielding balanced and diverse question distributions. This mechanism significantly improves recall (coverage of references) while maintaining or only slightly reducing precision, a result empirically verified by coverage metrics (e.g., BLEU, METEOR) substantially exceeding the strongest baselines as more questions are sampled per input.

4. Empirical Evaluation and Performance

Evaluation of document-grounded question generation is performed via standard reference-based metrics (BLEU-n, METEOR, ROUGE-L), analyzed for both precision-style (exact matches) and recall-style (coverage of reference questions). Notable findings include:

Substantial improvements in BLEU-4 (up to 97% over baseline on VQA) and METEOR (65% increase) when comparing grounded models to prior generative baselines.
Dramatic gains in recall/coverage (over 200% increase) when generating multiple questions per input, especially for datasets with longer or richer contexts.
Automatic balancing of question type distributions, reducing over-representation of generic question forms seen in human-annotated datasets (e.g., "what" questions falling from 89% to a more diverse distribution).
The integration of a bigram LLM into the decoder eliminates degenerate repetitions and measurably improves fluency, as reflected in higher n-gram overlap and subjective grammaticality.

Qualitative analyses consistently show that grounded generation yields not only more diverse but more informative and content-rich questions, including for less frequent or more challenging question types.

5. Adaptation to Text-Only, Document-Grounded Settings

The structured principles of visual question generation are directly applicable to text-based document-grounded question generation, with several adaptations:

Region segmentation becomes passage segmentation: Documents are divided into sentences or contextually coherent spans.
Visual features are substituted with contextual text embeddings: Transformer encoders (BERT, RoBERTa) provide feature vectors for passages.
Latent alignment methods (e.g., kernel density estimation): These can be used to probabilistically associate generated questions with possible answer spans or salient content when gold alignments are not annotated.
Probabilistic diversity and coverage: Instead of defaulting to majority question types, the system samples multiple highly probable types per segment, leading to richer and more challenging question sets.

For text, the challenge shifts toward robust passage scoring, anchoring questions to specific entities or events within a longer discourse, and ensuring that generated questions are linked to answerable spans or salient claims.

6. Applications and Broader Implications

Document-grounded question generation models have direct impact on several applications:

Data augmentation for QA and reading comprehension systems: Automatically generated diverse questions enable creation of larger, more varied training sets, improving system robustness and reducing annotation overhead.
Conversational and educational agents: By generating dynamic, content-anchored questions, such systems can facilitate more engaging dialog and adaptive assessment scenarios.
Interactive multi-turn QA and dialog: Document-grounded QG frameworks provide foundational primitives for agents that maintain dialog state and coherence when questioning about documents.
General-purpose document exploration tools: The approach supports extraction of explicit as well as inferential questions, offering end-users methods for self-paper or critical reading.

7. Summary Table: Core Components and Textual Adaptation

Component	Visual QG (Paper)	Document-Grounded QG (Text)
Region Segmentation	DenseCap image regions	Document passage segmentation
Feature Embedding	VGG-16 + LSTM	Transformer/BERT embeddings
Caption Modeling	DenseCap region captions	Extracted passage/sentence text
Question Type Selector	LSTM + Softmax	LSTM/Transformer + Softmax
Question Decoder	LSTM + Bigram LM	LSTM/Transformer + n-gram LM
Diversity Mechanism	Probabilistic sampling	Diversity via passage/type mixing
Alignment	Kernel density estimation	Kernel density on text similarity

Conclusion

The document-grounded question generation paradigm, exemplified by Zhang et al.'s probabilistic, modular framework, establishes a robust, extensible methodology for grounding natural language questions in structured as well as unstructured documents. The approach's explicit modeling of content segmentation, question-type conditioning, and joint decoding is directly transferable to the text domain, facilitating both high coverage and diversity. These models offer substantial advantages for data augmentation, interactive educational tools, and dialogue agents, defining a clear path for the scalable, automated generation of high-quality, document-centric question sets with minimal manual supervision.

PDF Markdown Chat (Upgrade)