InSQuAD: Exemplar Selection Framework
- InSQuAD is a framework that employs submodular mutual information to ensure the selected exemplars are both relevant and diverse for effective in-context learning.
- The approach uses a combinatorial training paradigm with a likelihood-based loss to optimize the balance between quality and diversity.
- Empirical results across nine benchmarks show significant improvements in classification, multi-choice, and generative QA tasks, reducing inference time.
InSQuAD is a framework for exemplar selection in In-Context Learning (ICL) that enforces both quality (relevance) and diversity among in-context examples using Submodular Mutual Information (SMI) functions and a combinatorial training paradigm. Developed to address limitations in traditional retrieval methods—where query relevance is modeled at the expense of diversity—InSQuAD achieves robust ICL by modeling exemplar selection as a targeted submodular maximization problem and by training a dedicated retrieval model via a likelihood-based loss over SMI. The approach is validated empirically across nine benchmark datasets, demonstrating substantial gains over relevance-only baselines and reducing inference time through efficient combinatorial selection and dataset augmentation with paraphrases.
1. Motivation and Problem Formulation
The premise of InSQuAD is that effective ICL requires selecting in-context exemplars that are not merely relevant to the test query, but also collectively diverse and non-redundant. Existing retrieval approaches predominantly optimize for quality—gathering examples nearest to the query in embedding space—yet ignore the combinatorial structure that arises when exemplars overlap semantically or syntactically. InSQuAD targets three properties: quality, diversity, and order.
To formalize, InSQuAD frames the selection as:
- Exemplar Annotation: Constructing a diverse subset from an unlabeled pool to represent the annotation distribution.
- Exemplar Retrieval: Given a query , selecting top- in-context examples that maximize both query similarity and non-overlap.
This strategy ensures that the chosen set maximizes information with respect to the query while minimizing redundancy among selected exemplars, which is crucial for prompting LLMs in multi-hop or reasoning-intensive QA.
2. Submodular Mutual Information (SMI) Functions
InSQuAD uses SMI functions to balance relevance and diversity in selection:
- Quality: Quantified by the mutual information between the exemplar set and the query, .
- Diversity: Enforced via submodular functions, which reward incremental “coverage” and penalize redundancy.
Formally,
where is the pool of candidate exemplars and is the SMI function. Submodularity ensures that greedy selection yields near-optimal solutions efficiently, capturing both incremental query relevance and pairwise diversity.
During annotation,
where is a budget and is scored with the full pool (for diversity).
3. Combinatorial Training and Likelihood-Based Loss
To prevent the retrieval model from overfitting to query similarity alone, InSQuAD introduces a combinatorial training protocol (InSQuAD-LEARN) that adapts SMI parameters through a likelihood-based loss derived from Submodular Point Processes (SPPs).
Given a query , a set of relevant documents , and distractor documents , the probability of choosing is:
The ratio for relevant over distractor sets is:
Yielding the negative log-likelihood:
The overall joint loss—including diversity enforced by paraphrastic augmentations—is:
where (quality loss) and (diversity loss) compare the information overlap between query, relevant, and paraphrased distractor sets, and weights their importance.
4. Dataset Augmentation via Paraphrases
A unique component is paraphrase augmentation. Multi-hop QA datasets, such as HotpotQA, lack sufficient paraphrastic or distractor variants. InSQuAD addresses this by synthetically generating paraphrases for each supporting document using large models (e.g., GPT-3.5 Turbo). Training instances thus comprise , (original relevant), (original distractor), and (paraphrased variants). This constrains the model to maximize true quality signals while actively ignoring paraphrase-level similarity that would otherwise compromise diversity.
5. Exemplar Selection and In-Context Generation
At inference, the retrieval model produces in-context exemplars for a test query using the SMI-based scoring. The major formulas are:
- Generation conditioning:
where is the LLM output, are selected exemplars, is a templating function, and are the learned parameters.
- Selection via SMI:
6. Experimental Validation and Results
On nine benchmarks (classification, multi-choice, and generative QA), InSQuAD-RETRIEVE plus InSQuAD-LEARN achieves:
- Up to 21.6% improvement on classification tasks
- 16.4% gains on multi-choice tasks
- Up to 7% improvement on generative ICL
Ablation studies demonstrate reduced inference time compared to iterative or confidence-based selection strategies. The approach produces superior retrieval sets with respect to the joint quality-diversity objective, demonstrating practical efficacy for academic and commercial LLM deployment.
7. Implications and Significance
By enforcing both quality and diversity in in-context example selection through submodular mutual information, InSQuAD improves generalization, robustness, and efficiency in ICL workflows. Its likelihood-based combinatorial training ensures that retrieval models move beyond nearest-neighbor heuristics, capturing complex relationships needed for compositional multi-task reasoning in modern QA systems. Synthetic paraphrase augmentation makes the approach viable even in data-sparse regimes by preventing spurious overlap. The framework is modular, permitting extension to other domains with pool-based selection and paraphrase augmentation.
A plausible implication is that future benchmarks (such as those targeting procedural guidance or multi-document conversational QA (Wu et al., 1 Oct 2024, Wu et al., 2023)) may adopt analogous SMI-based strategies to enforce comprehensive coverage and diversity in prompt construction and evaluation protocols.