Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 96 tok/s
Gemini 3.0 Pro 48 tok/s Pro
Gemini 2.5 Flash 155 tok/s Pro
Kimi K2 197 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Selective Retrieval-Augmentation (SRA)

Updated 31 August 2025
  • Selective Retrieval-Augmentation (SRA) is a data-centric methodology that augments underrepresented classes with semantically retrieved examples to address long-tail imbalances.
  • It employs a two-stage retrieval process using TF-IDF for candidate selection and SBERT for precise re-ranking without changing the underlying model architecture.
  • Empirical results on legal datasets like LEDGAR and UNFAIR-ToS demonstrate that SRA significantly boosts macro-F1 scores, especially for rare classes.

Selective Retrieval-Augmentation (SRA) is a data-centric methodology that targets the augmentation of neural models with selectively retrieved external information, optimizing for both efficiency and predictive performance. Unlike indiscriminate retrieval-based approaches, SRA introduces domain-, class-, or sample-dependent selectivity criteria to control when and how additional evidence is introduced during learning or inference. This selectivity is particularly effective in scenarios with skewed data distributions, such as long-tail legal text classification, where most classes are underrepresented and conventional learning approaches tend to overfit or underperform on rare classes (Mao, 27 Aug 2025).

1. Rationale and Motivation

Long-tail label distributions plague many real-world classification tasks by producing models that perform well on frequent (head) classes while neglecting rare (tail) ones. SRA directly addresses this by augmenting only low-frequency or underrepresented classes: for these, semantically similar examples are retrieved from the training data itself and incorporated into the model’s input pipeline. For common classes, no retrieval is performed, sidestepping the introduction of spurious or redundant information. In the legal domain, this selectivity both mitigates potential noise injection for well-represented classes and prevents information leakage by prohibiting retrieval from external corpora. The approach does not require architectural changes to the underlying model, enabling easy integration with existing transformer-based architectures such as RoBERTa-base (Mao, 27 Aug 2025).

2. Methodological Framework

SRA consists of two core stages:

a. Identification of Target Classes for Augmentation

Given a dataset D={(xi,yi)}i=1N\mathcal{D} = \{(x_i, y_i)\}_{i=1}^N with label set C\mathcal{C} and per-label frequencies f(c)f(c) for cCc \in \mathcal{C}, SRA first ranks all classes in ascending order of their frequency. A fixed cutoff ratio α\alpha (e.g., the bottom 65% of classes) is applied:

Clow={c(1),...,c(k)},k=αC\mathcal{C}_{\text{low}} = \{ c_{(1)}, ..., c_{(k)} \}, \quad k = \lfloor \alpha |\mathcal{C}| \rfloor

Only samples xx with label yClowy \in \mathcal{C}_{\text{low}} are augmented.

b. Selective Retrieval and Augmentation Process

For each eligible sample, SRA executes a two-stage retrieval:

  • Stage 1: TF-IDF ranking identifies the top-20 preliminary candidates from the training set.
  • Stage 2: Candidates are re-ranked using SBERT cosine similarity, and the top-kk example(s) are selected (with k=1k=1 by default).

The final augmented input is:

x=porigx[SEP]prefr1rkx' = p_{\text{orig}} \oplus x \oplus [\text{SEP}] \oplus p_{\text{ref}} \oplus r_1 \oplus \ldots \oplus r_k

where porigp_{\text{orig}} and prefp_{\text{ref}} are prompts (e.g., “Original clause:”, “Related clause for reference:”), the rjr_j are truncated (max 64 tokens) retrieved clauses, and \oplus denotes string concatenation.

This process is applied at train, dev, and test phases, always limiting the retrieval source to the training set.

3. Model Implementation

SRA operates as a wrapper at the data-processing level and is agnostic to the downstream model. In all reported experiments, the classification architecture itself is unchanged: RoBERTa-base encodes the concatenated input xx', and predictions are made via a linear classification head:

h=RoBERTa(x),y^=softmax(Wh+b)h = \text{RoBERTa}(x'), \qquad \hat{y} = \text{softmax}(W h + b)

Input truncation post-augmentation ensures all sequences fit the pre-set token length (512 tokens). This design ensures zero changes to the model parameters or the optimization routine—SRA is a pure augmentation strategy.

The retrieval pipeline uses classical IR tools:

  • First-pass selection: TF-IDF on BM25 or similar vector space models.
  • Second-pass refinement: SBERT for fine-grained semantic similarity.

4. Empirical Evaluation

Experiments were performed on two benchmark legal datasets with pronounced long-tail characteristics:

Dataset Task Type Challenge
LEDGAR Single-label contract clause classification Severe underrepresentation of labels
UNFAIR-ToS Multi-label unfairness detection in ToS clauses Dominant empty class (~89%)

Performance was measured using both micro-F1 and macro-F1. Macro-F1, being more sensitive to rare classes, is the metric of primary importance for assessing SRA’s effectiveness.

Key results include:

  • SRA with a 65% augmentation cutoff on LEDGAR achieves micro-F1 0.931, macro-F1 0.887, improving by ~+0.06+0.06 in macro-F1 over the RoBERTa-base baseline, and surpassing RoBERTa-large (macro-F1 0.862).
  • On UNFAIR-ToS, augmenting only non-empty label cases yields micro-F1 0.988, macro-F1 0.924, outperforming both RoBERTa-base and Legal-BERT baselines.
  • Full (non-selective) retrieval augmentation can degrade performance—macro-F1 on LEDGAR fell to 0.816—due to noise from augmenting head classes.

Ablation studies established that the optimal cutoff ratio varies by dataset, with 55–65% achieving best results on LEDGAR. Statistical validation (bootstrap CIs, McNemar’s test) confirmed the significant gains in macro-F1.

Dataset Baseline (Micro/Macro) SRA (Best, Micro/Macro)
LEDGAR 0.879 / 0.827 0.931 / 0.887
UNFAIR-ToS 0.952 / 0.807 0.988 / 0.924

5. Analysis: Selectivity and Long-Tail Benefits

Selective augmentation is particularly beneficial for rare and medium-frequency classes, as shown in bucketed performance analysis. SRA improves accuracy not just for the targeted tail but also yields moderate gains for head classes, which is attributed to improved representation of minority contexts during training. On the other hand, universal (non-selective) augmentation actively harms head classes by introducing spurious context, confirming the necessity of precise selectivity criteria.

The retrieval source being restricted to training data ensures no information leakage across train-dev-test splits, a critical concern in legal and sensitive domains.

6. Limitations and Future Directions

Several aspects define the boundary conditions and next steps for SRA:

  • The optimal cutoff α\alpha must be tuned per dataset and may change with underlying label distributions.
  • Retrieval and re-ranking are currently limited to the TF-IDF + SBERT pipeline; more advanced (e.g., dense neural retrievers, graph-aware retrieval) techniques could further enhance augmentation quality.
  • While training, validation, and test time augmentation ensure robustness, scaling to larger datasets could require efficiency optimizations in retrieval and sequence construction.
  • Extending SRA principles to domains beyond legal NLP, especially for other structured or weakly labeled long-tail datasets, is a promising direction.

7. Significance and Broader Implications

SRA provides a principled, architecture-neutral solution to long-tail imbalance, leveraging intra-domain retrieval for targeted class augmentation. Its clear improvements in macro-F1, especially for rare classes, underscore the importance of conditional augmentation policies over unselective approaches. The methodology is broadly applicable to any scenario where class skew presents a barrier to model generalization, and its implementation simplicity positions it as a strong baseline and potential building block for more sophisticated SRA methods.

This approach complements evolving lines of work in selective augmentation and retrieval-driven LLMs by offering an in-domain, resource-efficient, and information-leakage-safe paradigm with demonstrated empirical gains (Mao, 27 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Selective Retrieval-Augmentation (SRA).