Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 84 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Doc2Query++: Topic-Guided Document Expansion

Updated 13 October 2025
  • The paper introduces Dual-Index Fusion, decomposing text and query signals to mitigate concatenation-induced noise in dense retrieval.
  • Doc2Query++ employs unsupervised topic modeling with hybrid keyword selection to guide controlled and diverse query generation.
  • The framework achieves superior MAP, nDCG@10, and Recall@100 by explicitly ensuring comprehensive topic coverage and reducing query redundancy.

Doc2Query++ is a topic-coverage guided document expansion framework designed to address long-standing limitations of neural query generation in information retrieval. By explicitly structuring the query generation process to ensure coverage of a document’s latent topics and using hybrid keyword selection, Doc2Query++ outperforms prior methods in both sparse and dense retrieval across a diverse set of domains. Its signature contribution, Dual-Index Fusion, separates text and query signals in dense retrieval to mitigate concatenation-induced noise and enable robust performance improvements.

1. Motivation and Background

Document expansion (DE) via query prediction aims to ameliorate vocabulary mismatch between user queries and underlying document representations. Standard approaches such as Doc2Query (Nogueira et al., 2019) train sequence-to-sequence models to generate likely queries for each document, which are then appended to the document text before indexing. This method improves recall in sparse retrieval frameworks (e.g., BM25), but faces three core challenges:

  • Uncontrolled Generation: Sequence models may produce hallucinated, redundant, or semantically overlapping queries, resulting in incomplete topic coverage and disproportionate expansion of common word patterns.
  • Domain Generalization: Models trained on a single domain (e.g., MS MARCO) generalize poorly to out-of-domain datasets (e.g., BEIR), largely due to missing, mis-represented, or redundant topic facets.
  • Dense Retrieval Noise: Concatenation of synthetic queries to document text introduces distributional noise that degrades dense retrieval, where semantic embedding proximity is paramount.

While prompting LLMs increases cross-domain applicability, unconstrained prompts fail to ensure comprehensive and non-redundant topic coverage. Taxonomy-controlled methods, although effective, are domain-specific and limit broad adoption.

Doc2Query++ was introduced (Kuo et al., 10 Oct 2025) to explicitly structure document expansion by guiding query generation through inferred topics and hybrid keyword selection, yielding improvements on coverage, diversity, and downstream retrieval effectiveness.

2. Latent Topic Inference and Keyword Selection

Doc2Query++ first identifies latent topics within a document using unsupervised topic modeling, ensuring cross-domain coverage and adaptability:

  • Sentence Segmentation and Encoding: The document dd is split into sentences {si}\{s_i\} and each sentence embedding zsiz_{s_i} is computed via a domain-specific SBERT encoder.
  • Topic Clustering: BERTopic applies HDBSCAN clustering to the sentence embeddings, producing CC clusters (topics) with centroids {μj}j=1C\{\mu_j\}_{j=1}^C. Topic assignment:

ϕ(zs)=argminj{1,,C}zsμj2\phi(z_{s}) = \arg \min_{j \in \{1,\ldots,C\}} \|z_s - \mu_j\|_2

  • Class-based TF-IDF (c-TF-IDF) labeling: Each topic cluster is labeled by LLM-guided refinement of c-TF-IDF salient terms.

Doc2Query++ then constructs a hybrid keyword pool via:

  • Topic-level keywords: Top c-TF-IDF terms from each topic cluster (KdtopicK_d^{\text{topic}}).
  • Document-level phrases: KeyBERT ranks document n-grams by cosine similarity to document embedding, with Maximal Marginal Relevance (MMR) applied to prioritize diversity (KddocK_d^{\text{doc}}).

The union:

Kd=KdtopicKddocK_d = K_d^{\text{topic}} \cup K_d^{\text{doc}}

is filtered by an LLM to select a subset of keywords for robust topic coverage. This hybrid pool counters query redundancy and promotes representation of infrequent but significant topics.

3. Structured Query Generation and Diversity Control

Doc2Query++ leverages a few-shot prompting scheme where the LLM receives the latent topic labels and the selected keyword set per document. Carefully curated exemplars and instructions are provided to elicit generation of queries that:

  • Cover each inferred topic at least once
  • Utilize the full breadth of selected keywords, avoiding duplication
  • Maintain sufficient lexical and semantic diversity

Typically, the LLM is prompted to generate three queries per pass, each conditioned on different topic-keyword pairs. The result is a comprehensive, non-redundant set of synthetic queries for document expansion, which are then indexed alongside the original text.

4. Dual-Index Fusion for Dense Retrieval

To mitigate concatenation-induced noise in dense embedding-based retrieval, Doc2Query++ introduces Dual-Index Fusion:

  • Separate Embedding Spaces:
    • It\mathcal{I}_t: indexes original document embeddings, vd\mathbf{v}_d
    • Iq\mathcal{I}_q: indexes query embeddings generated from the synthetic queries per document, ud,i\mathbf{u}_{d, i}
  • Retrieval Operation:
    • For a query embedding vQ\mathbf{v}_Q:
    • Text index score: St(d)=sim(vQ,vd)S_t(d) = \text{sim}(\mathbf{v}_Q, \mathbf{v}_d)
    • Query index score: Sq(d)=maxjQdsim(vQ,uj)S_q(d) = \max_{j \in \mathcal{Q}_d} \text{sim}(\mathbf{v}_Q, \mathbf{u}_j), where Qd\mathcal{Q}_d indexes queries for document dd
  • Fusion Formula:

S(d)=(1α)St(d)+αSq(d),  α[0,1]S(d) = (1 - \alpha) S_t(d) + \alpha S_q(d),\ \ \alpha \in [0, 1]

α\alpha balances the influence of text and query signals.

This approach isolates semantic expansion benefits from document text, yielding improved MAP, nDCG@10, and Recall@100 in dense retrieval tasks. Dual-Index Fusion is critical to prevent detrimental effects associated with naive appending of queries in embedding-based retrieval.

5. Experimental Results and Comparative Performance

Doc2Query++ achieves superior performance compared to Doc2Query, Doc2Query–– (Gospodinov et al., 2023), and both zero/few-shot LLM prompting baselines:

  • Datasets: Evaluations span BEIR subsets including NFCorpus, SCIDOCS, FiQA-2018, Arguana, and Scifact, representing substantial domain heterogeneity.
  • Sparse Retrieval (e.g., BM25): Doc2Query++ increases MAP and nDCG@10 in all benchmarks, outperforming prior neural and LLM-based expansion methods.
  • Dense Retrieval: Contriever-based retrieval with Dual-Index Fusion shows further gains over concatenation strategies and single-index approaches.
  • Ablation Studies: Both topic modeling and hybrid keyword selection are shown to be necessary for optimal coverage and non-redundancy; omitting either reduces effectiveness.

Table: Key Improvements in Sparse and Dense Retrieval Settings

System Sparse MAP* Dense MAP* nDCG@10*
Doc2Query lower lower lower
Doc2Query–– mid mid mid
Doc2Query++ highest highest highest

*Relative scores; see (Kuo et al., 10 Oct 2025) Table 1 for full metrics.

6. Applications, Limitations, and Future Directions

Doc2Query++ is broadly applicable to:

  • Open-domain search and QA pipelines
  • Scientific, biomedical, and financial document retrieval
  • Any IR scenario encountering severe vocabulary mismatch

Its cross-domain robustness stems from unsupervised topic modeling, which adapts to dataset-specific thematic structure without domain-anchored taxonomies. The framework is compatible with both sparse and dense retrieval infrastructures.

Limitations include:

  • Computational overhead from multi-stage topic modeling and query generation
  • Dependence on the quality of topic modeling, LLM prompting, and fusion parameter tuning

Future work may address adaptive expansion control mechanisms, retrieval-informed feedback for iterative topic refinement, and extensions to multilingual or cross-modal retrieval.

7. Figures and Formalization

The core Doc2Query++ pipeline can be represented visually as:

  1. Topic segmentation (via BERTopic/HDBSCAN and SBERT embeddings)
  2. Hybrid keyword extraction (c-TF-IDF + KeyBERT/MMR)
  3. LLM-structured query generation
  4. Retrieval with Dual-Index Fusion:

S(d)=(1α)St(d)+αSq(d),  St(d)=sim(vQ,vd),  Sq(d)=maxjQdsim(vQ,uj)S(d) = (1 - \alpha) S_t(d) + \alpha S_q(d),\ \ S_t(d) = \text{sim}(\mathbf{v}_Q, \mathbf{v}_d),\ \ S_q(d) = \max_{j \in \mathcal{Q}_d} \text{sim}(\mathbf{v}_Q, \mathbf{u}_j)

  1. Flexible weighting (α\alpha) for different retrieval scenarios

Conclusion

Doc2Query++ establishes a topic-coverage based paradigm for document expansion, leveraging unsupervised latent topic inference and hybrid keyword extraction to direct controlled, diverse query generation. Through Dual-Index Fusion, it resolves the tension between enrichment and noise in dense retrieval. Experiments demonstrate its superior retrieval effectiveness across domains, positioning Doc2Query++ as a robust framework for addressing vocabulary mismatch in both classical and neural IR systems (Kuo et al., 10 Oct 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Doc2Query++.