Doc2Query++: Topic-Guided Document Expansion
- The paper introduces Dual-Index Fusion, decomposing text and query signals to mitigate concatenation-induced noise in dense retrieval.
- Doc2Query++ employs unsupervised topic modeling with hybrid keyword selection to guide controlled and diverse query generation.
- The framework achieves superior MAP, nDCG@10, and Recall@100 by explicitly ensuring comprehensive topic coverage and reducing query redundancy.
Doc2Query++ is a topic-coverage guided document expansion framework designed to address long-standing limitations of neural query generation in information retrieval. By explicitly structuring the query generation process to ensure coverage of a document’s latent topics and using hybrid keyword selection, Doc2Query++ outperforms prior methods in both sparse and dense retrieval across a diverse set of domains. Its signature contribution, Dual-Index Fusion, separates text and query signals in dense retrieval to mitigate concatenation-induced noise and enable robust performance improvements.
1. Motivation and Background
Document expansion (DE) via query prediction aims to ameliorate vocabulary mismatch between user queries and underlying document representations. Standard approaches such as Doc2Query (Nogueira et al., 2019) train sequence-to-sequence models to generate likely queries for each document, which are then appended to the document text before indexing. This method improves recall in sparse retrieval frameworks (e.g., BM25), but faces three core challenges:
- Uncontrolled Generation: Sequence models may produce hallucinated, redundant, or semantically overlapping queries, resulting in incomplete topic coverage and disproportionate expansion of common word patterns.
- Domain Generalization: Models trained on a single domain (e.g., MS MARCO) generalize poorly to out-of-domain datasets (e.g., BEIR), largely due to missing, mis-represented, or redundant topic facets.
- Dense Retrieval Noise: Concatenation of synthetic queries to document text introduces distributional noise that degrades dense retrieval, where semantic embedding proximity is paramount.
While prompting LLMs increases cross-domain applicability, unconstrained prompts fail to ensure comprehensive and non-redundant topic coverage. Taxonomy-controlled methods, although effective, are domain-specific and limit broad adoption.
Doc2Query++ was introduced (Kuo et al., 10 Oct 2025) to explicitly structure document expansion by guiding query generation through inferred topics and hybrid keyword selection, yielding improvements on coverage, diversity, and downstream retrieval effectiveness.
2. Latent Topic Inference and Keyword Selection
Doc2Query++ first identifies latent topics within a document using unsupervised topic modeling, ensuring cross-domain coverage and adaptability:
- Sentence Segmentation and Encoding: The document is split into sentences and each sentence embedding is computed via a domain-specific SBERT encoder.
- Topic Clustering: BERTopic applies HDBSCAN clustering to the sentence embeddings, producing clusters (topics) with centroids . Topic assignment:
- Class-based TF-IDF (c-TF-IDF) labeling: Each topic cluster is labeled by LLM-guided refinement of c-TF-IDF salient terms.
Doc2Query++ then constructs a hybrid keyword pool via:
- Topic-level keywords: Top c-TF-IDF terms from each topic cluster ().
- Document-level phrases: KeyBERT ranks document n-grams by cosine similarity to document embedding, with Maximal Marginal Relevance (MMR) applied to prioritize diversity ().
The union:
is filtered by an LLM to select a subset of keywords for robust topic coverage. This hybrid pool counters query redundancy and promotes representation of infrequent but significant topics.
3. Structured Query Generation and Diversity Control
Doc2Query++ leverages a few-shot prompting scheme where the LLM receives the latent topic labels and the selected keyword set per document. Carefully curated exemplars and instructions are provided to elicit generation of queries that:
- Cover each inferred topic at least once
- Utilize the full breadth of selected keywords, avoiding duplication
- Maintain sufficient lexical and semantic diversity
Typically, the LLM is prompted to generate three queries per pass, each conditioned on different topic-keyword pairs. The result is a comprehensive, non-redundant set of synthetic queries for document expansion, which are then indexed alongside the original text.
4. Dual-Index Fusion for Dense Retrieval
To mitigate concatenation-induced noise in dense embedding-based retrieval, Doc2Query++ introduces Dual-Index Fusion:
- Separate Embedding Spaces:
- : indexes original document embeddings,
- : indexes query embeddings generated from the synthetic queries per document,
- Retrieval Operation:
- For a query embedding :
- Text index score:
- Query index score: , where indexes queries for document
- Fusion Formula:
balances the influence of text and query signals.
This approach isolates semantic expansion benefits from document text, yielding improved MAP, nDCG@10, and Recall@100 in dense retrieval tasks. Dual-Index Fusion is critical to prevent detrimental effects associated with naive appending of queries in embedding-based retrieval.
5. Experimental Results and Comparative Performance
Doc2Query++ achieves superior performance compared to Doc2Query, Doc2Query–– (Gospodinov et al., 2023), and both zero/few-shot LLM prompting baselines:
- Datasets: Evaluations span BEIR subsets including NFCorpus, SCIDOCS, FiQA-2018, Arguana, and Scifact, representing substantial domain heterogeneity.
- Sparse Retrieval (e.g., BM25): Doc2Query++ increases MAP and nDCG@10 in all benchmarks, outperforming prior neural and LLM-based expansion methods.
- Dense Retrieval: Contriever-based retrieval with Dual-Index Fusion shows further gains over concatenation strategies and single-index approaches.
- Ablation Studies: Both topic modeling and hybrid keyword selection are shown to be necessary for optimal coverage and non-redundancy; omitting either reduces effectiveness.
Table: Key Improvements in Sparse and Dense Retrieval Settings
| System | Sparse MAP* | Dense MAP* | nDCG@10* |
|---|---|---|---|
| Doc2Query | lower | lower | lower |
| Doc2Query–– | mid | mid | mid |
| Doc2Query++ | highest | highest | highest |
*Relative scores; see (Kuo et al., 10 Oct 2025) Table 1 for full metrics.
6. Applications, Limitations, and Future Directions
Doc2Query++ is broadly applicable to:
- Open-domain search and QA pipelines
- Scientific, biomedical, and financial document retrieval
- Any IR scenario encountering severe vocabulary mismatch
Its cross-domain robustness stems from unsupervised topic modeling, which adapts to dataset-specific thematic structure without domain-anchored taxonomies. The framework is compatible with both sparse and dense retrieval infrastructures.
Limitations include:
- Computational overhead from multi-stage topic modeling and query generation
- Dependence on the quality of topic modeling, LLM prompting, and fusion parameter tuning
Future work may address adaptive expansion control mechanisms, retrieval-informed feedback for iterative topic refinement, and extensions to multilingual or cross-modal retrieval.
7. Figures and Formalization
The core Doc2Query++ pipeline can be represented visually as:
- Topic segmentation (via BERTopic/HDBSCAN and SBERT embeddings)
- Hybrid keyword extraction (c-TF-IDF + KeyBERT/MMR)
- LLM-structured query generation
- Retrieval with Dual-Index Fusion:
- Flexible weighting () for different retrieval scenarios
Conclusion
Doc2Query++ establishes a topic-coverage based paradigm for document expansion, leveraging unsupervised latent topic inference and hybrid keyword extraction to direct controlled, diverse query generation. Through Dual-Index Fusion, it resolves the tension between enrichment and noise in dense retrieval. Experiments demonstrate its superior retrieval effectiveness across domains, positioning Doc2Query++ as a robust framework for addressing vocabulary mismatch in both classical and neural IR systems (Kuo et al., 10 Oct 2025).