Proactive Question Generation

Updated 9 December 2025

Proactive Question Generation is a method that actively expands dialogue by adding targeted follow-up questions and additional information to enrich user interactions.
It employs multi-step reasoning techniques such as chain-of-thought prompting, planning-based strategies, and reinforcement learning to drive multi-turn engagement.
Evaluation relies on semantic similarity, user simulation, and classification metrics to measure effectiveness in uncovering latent user intent and clarifying ambiguous queries.

Proactive Question Generation (PQG) denotes the class of methodologies, models, and evaluation frameworks for constructing dialogue agents and information-seeking systems that do not merely react to user queries but actively engage users by introducing new related information, asking targeted follow-up questions, and strategically steering conversations toward more comprehensive or goal-oriented outcomes. Unlike reactive paradigms, which terminate dialogue upon answering the stated query, PQG systems extend interaction through deliberate information expansion and user engagement, supporting richer, multi-turn exploration, clarification, and personalized discovery.

1. Formal Definitions and Paradigms of Proactivity

The definitive recent formulation of proactivity in Information-Seeking Dialogue (ISD) is attributed to Lee et al. (Lee et al., 2024). Here, a proactive response $R$ to a user query $Q$ comprises:

An Answer: a direct reply addressing $Q$ .
A Proactive Element: new information related to $Q$ , either as a Follow-up Question (FQ) or Additional Information (AI). An FQ asks if the user wants a specific related fact; an AI offers that fact directly.

Formally:

$R$ is labeled proactive if both elements are present.
FQ: “Would you like to learn about her other roles in the MCU?”
AI: “Did you know she also appears in Guardians of the Galaxy Vol. 2?”

This architecture evaluates each response for conversational engagement through the explicit introduction of related knowledge, setting PQG apart from reactive systems that end the session after minimal answer delivery.

In conclusion-driven conversational question generation (CCQG), as instantiated by PCQPR (Guo et al., 2024), proactiveness is defined by the agent's planning over multiple conversational turns to drive the dialogue toward a predefined outcome $T = (q_n, an_n)$ , optimizing not just local coherence but global trajectory.

2. Model Architectures and Algorithmic Strategies

Chain-of-Thought and Task Decomposition

PQG is frequently realized through in-context learning with Chain-of-Thought (CoT) prompting (e.g., 3-step CoT, 3-in-1 CoT (Lee et al., 2024)):

$P_1$ : conversational answer generation.
$P_2$ : extraction of a specific related fact not present in $P_1$ .
$P_3$ : proactive element generation as either FQ or AI.

Instruction-tuning methods (e.g., QLoRA over Falcon-40B-Instruct) encode PQG objectives via composite prompt templates that enforce multi-step reasoning, yielding substantial improvements (up to +90% gains) over direct zero-shot approaches.

Planning-Based and RL Frameworks

PCQPR employs an MCTS-like planner combined with LLM rollouts and “comparable reflection” (Guo et al., 2024):

State: $\langle C, H, S_{\text{partial}} \rangle$ ; Action: next QA pair.
Trees simulate future turns, backpropagate rewards, and iterate question plans conditioned on feedback, with long-term optimization toward terminal success.

Reinforcement learning, especially as formulated in proactive information gathering (Huang et al., 28 Jul 2025), rewards clarifying questions that reveal latent user constraints or requirements, optimizing LLMs (e.g., Qwen-2.5-7B) with explicit evidence-sentence rewards:

Reward $r_t(q_t) = 1$ if the question elicits hidden user intent; else $0$.
PPO is used for policy optimization, aligning question generation with user-specific latent knowledge.

Graph-Guided and Knowledge-Based Methods

Graph-structured conditioning (AMR graphs, action-flow graphs (Pham et al., 2024)) guarantees exhaustive semantic coverage for PQG in procedural and task-specific contexts:

Each graph node (concept/action/ingredient) is mapped to one or more QA pairs.
Resulting datasets demonstrate fine-grained coverage and support the training of compact QA models exceeding large LMs on BLEURT and coverage metrics.

Knowledge-based frameworks (e.g., KBQG for conversational recommendation (Ren et al., 2021)) exploit knowledge graph (KG) user–item–relation embeddings, mining the most informative relations through attention mechanisms, and prompting users with personalized, slot-filling clarification questions.

3. Automatic Evaluation Metrics for Proactiveness

PQG is evaluated via diverse metrics reflecting both the direct success of proactive engagement and its correlation with human judgment:

Semantic Similarity-based Metrics (Lee et al., 2024): combine BERTScore between $Q$ $Q$ and proactive element $R$ $R$ , token-internal BERTScore, and weightings via $\alpha$ $α$ .
- $S_{FQ} = \alpha BS(Q,R) + (1-\alpha)\bar{BS}(R)$
- $S_{AI} = \alpha BS(Q,R) + (1-\alpha)(1-\bar{BS}(R))$
- High point-biserial correlations with human annotation ($0.46$–$0.58$).
User Simulation-based Metrics (Lee et al., 2024): simulate user replies to $R$ via LLM, compute sentiment via RoBERTa, aggregate positive scores.
Classification-based Methods (Lee et al., 2024): fine-tuned DeBERTa-V3-Large classifiers label valid/invalid FQ/AI elements, yielding logit scores.
Task-Specific Proactivity Metrics (Deng et al., 2022): e.g., Clarification Need Prediction F1, ROUGE-2 for clarification questions, and “Proactivity Score” for joint success on ambiguous input detection and clarification.
Planning Outcomes (Guo et al., 2024): Success Rate (terminal QA matches $T$ ), semantic similarity scores (SimCSE), and coherence metrics (Conv-last1/2).

Tables below summarize metric categories and key findings:

Metric	Domain	Maximum Correlation
Semantic similarity (BS)	ISD, KG dialog	FQ: 0.46, AI: 0.58
User simulation	ISD	FQ: 0.26, AI: 0.33
Classification-based	ISD	FQ: 0.19, AI: 0.49

Metric	PCQPR (CoQA)	SG-CQG	COT	TOT	GPT-4-Turbo
Success Rate (%)	35.00	15.40	19.20	23.20	12.80

4. Datasets and Domain Coverage

PQG research comprises multiple specialized corpus constructions:

Proactive Dialogue Dataset (Lee et al., 2024): 2,000 single-turn ISD conversations via NQQA, balanced between FQ and AI, each annotated through crowdsourcing.
PACIFIC (Deng et al., 2022): hybrid tabular/text domain in finance, with explicit ambiguity induction and annotation for need-clarify detection and clarification question generation.
AmbigNQ/PAQA (Erbacher et al., 2024): large-scale open-retrieval QA with ambiguous questions mapped to gold clarifying questions, supporting proactive handling of ambiguous search.
KGConv (Faille et al., 2024): knowledge-driven dialogs with fact selection steps for explainable PQG, enabling fact–question mappings and reference-less evaluation.
Procedural QA Graphs (Pham et al., 2024): exhaustive QA dataset for procedural text, via AMR and flow graph templating.

5. Empirical Results and Comparative Analysis

PQG architectures yield marked improvements over baselines on multiple fronts:

ISD (Falcon-40B-Instruct, (Lee et al., 2024)): 3-step CoT and 3-in-1 CoT zero-shot prompt designs increase FQ classification accuracy from 0.73 to 0.88.
- Few-shot CoT prompts produce up to +90% gains in zero-shot settings.
- Supervised fine-tuning matches or exceeds 3-shot prompting (FQ classification: 0.94).
CCQG (GPT-4-Turbo, (Guo et al., 2024)): PCQPR increases Success Rate to 35%, +11.8 pp over strong planning baselines.
Procedural QA (Pham et al., 2024): BLEURT scores on graph-generated data match or exceed GPT3/ChatGPT, demonstrating the importance of semantic coverage.
Finance QA (Deng et al., 2022): UniPCQA achieves >87% ROUGE-2 on clarifier generation, >91% F1 in ambiguity detection.
Open-Retrieval QA (Erbacher et al., 2024): Adding gold evidence passages boosts ambiguity detection accuracy from 0.527 (Q-only) to 0.873.
Mental Health Diagnostics (Roy et al., 2023): ProKnow-algo reduces unsafe matches by 89%, improves explainability scores by +0.5 vs. baseline, and achieves an 82% composite gain.

6. Practical Extensions and Future Directions

PQG is extensible across domains and architectures:

Finer-grained proactive elements: e.g., comparative or conditional follow-ups (Lee et al., 2024).
Reward modeling and RLHF: direct optimization of conversational proactivity.
Multi-turn annotation corpora: moving beyond single-turn protocols.
Factuality checks: mitigating hallucination risk while maintaining engagement.
Conversation-level metrics: capturing cumulative proactivity across sessions.

Algorithmic and data-centric future work is focused on scaling to longer conversations, integrating human reflection feedback, domain adaptation (medical/finance/legal), dynamic knowledge-graph augmentation, and learning-to-rank candidate follow-ups.

7. Significance and Open Challenges

PQG fundamentally transforms the capabilities of dialogue agents: from passive answerers to strategic collaborators capable of uncovering user intent, reducing ambiguity, improving knowledge coverage, and supporting discovery-focused human–AI interaction. Critical challenges persist in cost-effective depth planning, automatic ambiguity detection in open domains, robust metric development for multi-turn proactivity, and the scalable curation of annotated datasets for domain-specific needs.

Proactive methodologies, from CoT decomposition and reward-driven RL to graph-driven semantic coverage and knowledge-based slot-filling, collectively define the technical landscape of PQG, guiding ongoing research and practical deployment across information-seeking, recommendation, search, and creative domains.