Context Association Tests (CATs)

Updated 10 April 2026

Context Association Tests (CATs) are methodologies that assess context utilization in models and tests, using counterfactual perturbation and adaptive item selection.
The counterfactual attentiveness approach randomly replaces input components to reveal if models rely on full context, enhancing evaluation in NLP and vision tasks.
Computerized Adaptive Testing leverages psychometric models and contextual bandit techniques to select optimal items, improving test reliability and efficiency.

Context Association Tests (CATs) refer to two distinct families of methodologies in contemporary research: (1) the Counterfactual Attentiveness Test for evaluating attentiveness in paired-input tasks within NLP and vision, and (2) Computerized Adaptive Testing, which adapts item administration based on test taker ability using machine learning and psychometric models. Each system is rooted in rigorous statistical frameworks and addresses the measurement of context utilization—either by models or by human subjects—under conditions of distributional uncertainty or adaptive administration.

1. Counterfactual Attentiveness Test: Definition and Formalism

The Counterfactual Attentiveness Test (CAT) is a black-box, automated evaluation protocol for measuring whether a model’s prediction depends on all components of a paired input, or whether it is susceptible to spurious correlations by overly attending to a single part. For input pairs $x = (x_1, x_2)$ and a model $M$ , the procedure involves generating counterfactual examples by randomly replacing one component ( $x_1$ ) with its counterpart from another datum, producing new input pairs $(x_1', x_2)$ . By observing whether the model’s prediction changes when presented with this unrelated pair, researchers quantitatively measure "attentiveness"—the sensitivity of predictions to full input information.

Formally, the CAT attentiveness score is:

$\mathrm{CAT}(M) = \frac{1}{|D'|}\sum_{i\in D'} \frac{1}{K}\sum_{j=1}^K \mathbf{1}\big[M(x_{1i}, x_{2i}) \neq M(x'_{1i,j}, x_{2i})\big]$

where $D'$ is the set of evaluation points on which $M$ does not return the default label $y_0$ , and $K$ is the number of counterfactual replacements per example. The metric approaches 1.0 only when the model exhibits robust context usage, altering its output nearly every time a context piece is swapped for a random unrelated one (Elazar et al., 2023).

2. Construction and Properties of Counterfactuals

CAT leverages randomized, automated generation of counterfactual pairs. The steps are:

Identify each dataset instance $(x_1, x_2)$ for which $M$ 0.
For each such instance, sample $M$ 1 alternative $M$ 2 from other examples within the data split.
Create counterfactual examples $M$ 3 for all replacements.
Compute the frequency with which $M$ 4's prediction changes.
Derive the aggregate CAT score as in the metric above.

Empirical validation asserts that these swaps, in nearly all (92–100%) studied datasets, generate genuinely "unrelated" input pairs—meaning that a model with correct context dependency should revert to a default/neutral prediction under such perturbation. Manual inspection confirmed that this assumption holds broadly, with exceptions requiring additional verification (e.g., in SNLI) (Elazar et al., 2023).

3. Empirical Applications in NLP and Multimodal Tasks

CAT has been extensively evaluated across:

Natural Language Inference (NLI): RTE, MNLI, WANLI
Paraphrase Detection (PD): QQP, PAWS
Reading Comprehension (RC): SQuAD 2.0, DuoRC, NewsQA
Visual + Language Reasoning: VQA 2.0, NLVR2

Model families investigated comprise fine-tuned transformers (BERT, RoBERTa, DeBERTa, T5, Flan-T5), multimodal encoders (BLIP, ViLT), and large-scale in-context learning models (text-davinci-003 / GPT-3 and open-source Flan-T5).

Key results include:

Low partial-input correlations in the data lead to high CAT attentiveness (approaching 100%).
High partial-input correlations do not automatically imply model inattentiveness; for instance, SQuAD 2.0 exhibits high partial-input correlation, yet models remain highly attentive (about 98–99% CAT).
In NLI, notably MNLI, substantial label–hypothesis correlation lowers CAT scores to 47–82%, exposing model inattentiveness on some examples.
In-context learning (ICL) with many demonstrations increases test accuracy but may reduce CAT scores, indicating overreliance on demonstration patterns (Elazar et al., 2023).

4. Robustness and Counterfactual Data Augmentation

CAT-style counterfactuals are computationally inexpensive due to their synthetic nature. The same process is used for training augmentation by injecting explicit "neutral" or no-relation examples during training or prompt construction:

Supervised augmentation: For each non-default training example, add a counterfactual with the unrelated component, labeled as default.
ICL augmentation: Include analogous counterfactuals within demonstration sets, labeling them as default.

Empirical results show substantial increases in model attentiveness (e.g., T5-large on MNLI: 71.5% → 99.7% CAT after augmentation) with no loss of held-out accuracy, affirming the utility of this approach for mitigating shortcutting behavior. GPT-3 demonstrated a 10–15 point gain in attentiveness at slight cost to raw accuracy, suggesting the approach corrects conditional usage without universally harming predictive power (Elazar et al., 2023).

5. Best Practices, Limitations, and Interpretive Considerations

CAT provides a causal evaluation lens for paired-input tasks, supplementing conventional accuracy metrics—especially in domains prone to spurious correlations or prompt contamination. Best practices include verifying that randomized swaps reliably result in default/no-relation labels for each dataset and explicitly reporting CAT alongside accuracy.

A high CAT score is not a complete certificate of model robustness, as it does not guarantee protection against all adversarial strategies (e.g., lexical overlap adversaries). However, its diagnostic specificity for context dependence makes it a vital tool for both model evaluation and intervention.

6. Computerized Adaptive Testing and Contextual Bandit Frameworks

In a separate thread under the “CAT” acronym, Computerized Adaptive Testing leverages psychometric models—primarily Item Response Theory (IRT)—to adaptively select assessment items maximizing informativeness about latent test taker ability. Recent advances integrate automated item calibration (AutoIRT) and contextual bandit-style administration (BanditCAT):

AutoIRT: Learns item parameters $M$ 5 from item features using AutoML (e.g., BERT embeddings, tabular features) and non-parametric probability prediction, then projects this onto IRT parameter space with least-squares fitting.
BanditCAT: Casts adaptive test administration as a contextual bandit problem where the reward for each item is the Fisher information $M$ 6 at the current posterior over ability. Thompson sampling and parameter noise injection balance exploration (novel items) and exploitation (highly informative items), while controlling for item exposure.

Practical deployments on Duolingo English Test vocabulary items demonstrate that the system can deliver high test reliability, efficient coverage, and low item exposure, achieving reliability on par with fixed-form tests using fewer items due to optimal adaptive selection (Sharpnack et al., 2024).

7. Summary Table: CAT Methodology and Scope

Domain	Principle	Representative Reference
Counterfactual Attentiveness	Input perturbation to assess model context usage	(Elazar et al., 2023)
Computerized Adaptive Testing	Adaptive item selection via IRT, AutoML, and bandit methods	(Sharpnack et al., 2024)

The term “CAT” consequently encapsulates both counterfactual-based diagnostic tools in machine learning and adaptive psychometric testing frameworks, each undergirded by context sensitivity: the former measures it in learned models, the latter exploits it for efficient human ability estimation.

Markdown Report Issue Upgrade to Chat

References (2)

Measuring and Improving Attentiveness to Partial Inputs with Counterfactuals (2023)

BanditCAT and AutoIRT: Machine Learning Approaches to Computerized Adaptive Testing and Item Calibration (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Context Association Tests (CATs).