Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 40 tok/s
GPT-5 High 38 tok/s Pro
GPT-4o 101 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 161 tok/s Pro
2000 character limit reached

InSQuAD: Exemplar Selection Framework

Updated 29 August 2025
  • InSQuAD is a framework that employs submodular mutual information to ensure the selected exemplars are both relevant and diverse for effective in-context learning.
  • The approach uses a combinatorial training paradigm with a likelihood-based loss to optimize the balance between quality and diversity.
  • Empirical results across nine benchmarks show significant improvements in classification, multi-choice, and generative QA tasks, reducing inference time.

InSQuAD is a framework for exemplar selection in In-Context Learning (ICL) that enforces both quality (relevance) and diversity among in-context examples using Submodular Mutual Information (SMI) functions and a combinatorial training paradigm. Developed to address limitations in traditional retrieval methods—where query relevance is modeled at the expense of diversity—InSQuAD achieves robust ICL by modeling exemplar selection as a targeted submodular maximization problem and by training a dedicated retrieval model via a likelihood-based loss over SMI. The approach is validated empirically across nine benchmark datasets, demonstrating substantial gains over relevance-only baselines and reducing inference time through efficient combinatorial selection and dataset augmentation with paraphrases.

1. Motivation and Problem Formulation

The premise of InSQuAD is that effective ICL requires selecting in-context exemplars that are not merely relevant to the test query, but also collectively diverse and non-redundant. Existing retrieval approaches predominantly optimize for quality—gathering examples nearest to the query in embedding space—yet ignore the combinatorial structure that arises when exemplars overlap semantically or syntactically. InSQuAD targets three properties: quality, diversity, and order.

To formalize, InSQuAD frames the selection as:

  • Exemplar Annotation: Constructing a diverse subset from an unlabeled pool to represent the annotation distribution.
  • Exemplar Retrieval: Given a query qtestq_{\text{test}}, selecting top-kk in-context examples C\mathcal{C} that maximize both query similarity and non-overlap.

This strategy ensures that the chosen set C\mathcal{C} maximizes information with respect to the query while minimizing redundancy among selected exemplars, which is crucial for prompting LLMs in multi-hop or reasoning-intensive QA.

2. Submodular Mutual Information (SMI) Functions

InSQuAD uses SMI functions to balance relevance and diversity in selection:

  • Quality: Quantified by the mutual information between the exemplar set and the query, If(C;qtest)I_f(\mathcal{C}; q_{\text{test}}).
  • Diversity: Enforced via submodular functions, which reward incremental “coverage” and penalize redundancy.

Formally,

CargmaxCVlabeled,CkIf(C;qtest)\mathcal{C} \leftarrow \underset{\mathcal{C} \subseteq V_{\text{labeled}}, |\mathcal{C}| \leq k}{\arg\max} I_f(\mathcal{C}; q_{\text{test}})

where VlabeledV_{\text{labeled}} is the pool of candidate exemplars and IfI_f is the SMI function. Submodularity ensures that greedy selection yields near-optimal solutions efficiently, capturing both incremental query relevance and pairwise diversity.

During annotation,

VshortlistedargmaxVshortlistedV,VshortlistedBf(Vshortlisted)V_{\text{shortlisted}} \leftarrow \underset{V_{\text{shortlisted}} \subseteq V, |V_{\text{shortlisted}}| \leq B}{\arg\max} f(V_{\text{shortlisted}})

where BB is a budget and ff is scored with the full pool (for diversity).

3. Combinatorial Training and Likelihood-Based Loss

To prevent the retrieval model from overfitting to query similarity alone, InSQuAD introduces a combinatorial training protocol (InSQuAD-LEARN) that adapts SMI parameters through a likelihood-based loss derived from Submodular Point Processes (SPPs).

Given a query QQ, a set of relevant documents S+S^+, and distractor documents SS^-, the probability of choosing SS is:

PθQ(S)=If(S;Q)SVIf(S;Q)P_\theta^Q(S) = \frac{I_f(S; Q)}{\sum_{S' \subseteq V} I_f(S'; Q)}

The ratio for relevant over distractor sets is:

αQθ=If(S+;Q)If(S;Q)\alpha_Q^\theta = \frac{I_f(S^+; Q)}{I_f(S^-; Q)}

Yielding the negative log-likelihood:

L=log(αQθ)=log(If(S;Q))log(If(S+;Q))L = -\log(\alpha_Q^\theta) = \log(I_f(S^-; Q)) - \log(I_f(S^+; Q))

The overall joint loss—including diversity enforced by paraphrastic augmentations—is:

LInSQuaD=exp((1λ)Lq+λLd)L_{\text{InSQuaD}} = \exp\big((1 - \lambda) L_q + \lambda L_d\big)

where LqL_q (quality loss) and LdL_d (diversity loss) compare the information overlap between query, relevant, and paraphrased distractor sets, and λ\lambda weights their importance.

4. Dataset Augmentation via Paraphrases

A unique component is paraphrase augmentation. Multi-hop QA datasets, such as HotpotQA, lack sufficient paraphrastic or distractor variants. InSQuAD addresses this by synthetically generating paraphrases for each supporting document using large models (e.g., GPT-3.5 Turbo). Training instances thus comprise qq, S+S^+ (original relevant), SS^- (original distractor), and SpS^p (paraphrased variants). This constrains the model to maximize true quality signals while actively ignoring paraphrase-level similarity that would otherwise compromise diversity.

5. Exemplar Selection and In-Context Generation

At inference, the retrieval model R(,θ)R(\cdot, \theta) produces in-context exemplars for a test query using the SMI-based scoring. The major formulas are:

  • Generation conditioning:

p(ytestC,qtest)=M(V(ytest)C,T(qtest);θ^)p(y_{\text{test}} | \mathcal{C}, q_{\text{test}}) = \mathcal{M}(V(y_{\text{test}}) | \mathcal{C}, T(q_{\text{test}}); \hat{\theta})

where ytesty_{\text{test}} is the LLM output, C\mathcal{C} are selected exemplars, TT is a templating function, and θ^\hat{\theta} are the learned parameters.

  • Selection via SMI:

CargmaxCVlabeled,CkIf(C;qtest)\mathcal{C} \leftarrow \underset{\mathcal{C} \subseteq V_{\text{labeled}}, |\mathcal{C}| \leq k}{\arg\max} I_f(\mathcal{C}; q_{\text{test}})

6. Experimental Validation and Results

On nine benchmarks (classification, multi-choice, and generative QA), InSQuAD-RETRIEVE plus InSQuAD-LEARN achieves:

  • Up to 21.6% improvement on classification tasks
  • 16.4% gains on multi-choice tasks
  • Up to 7% improvement on generative ICL

Ablation studies demonstrate reduced inference time compared to iterative or confidence-based selection strategies. The approach produces superior retrieval sets with respect to the joint quality-diversity objective, demonstrating practical efficacy for academic and commercial LLM deployment.

7. Implications and Significance

By enforcing both quality and diversity in in-context example selection through submodular mutual information, InSQuAD improves generalization, robustness, and efficiency in ICL workflows. Its likelihood-based combinatorial training ensures that retrieval models move beyond nearest-neighbor heuristics, capturing complex relationships needed for compositional multi-task reasoning in modern QA systems. Synthetic paraphrase augmentation makes the approach viable even in data-sparse regimes by preventing spurious overlap. The framework is modular, permitting extension to other domains with pool-based selection and paraphrase augmentation.

A plausible implication is that future benchmarks (such as those targeting procedural guidance or multi-document conversational QA (Wu et al., 1 Oct 2024, Wu et al., 2023)) may adopt analogous SMI-based strategies to enforce comprehensive coverage and diversity in prompt construction and evaluation protocols.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)