Contextualized Embedding Association Test

Updated 9 March 2026

CEAT is a statistical methodology that measures social and demographic biases by evaluating association scores at the token level in contextualized embeddings.
It generalizes traditional tests like WEAT by incorporating context dependence, using cosine similarity metrics and meta-analysis to aggregate bias effects.
CEAT’s integration with automated pipelines and RAG+LLM extraction frameworks enables scalable, reproducible audits of AI-generated content for nuanced intersectional biases.

The Contextualized Embedding Association Test (CEAT) is a statistical methodology for measuring and quantifying social, demographic, and intersectional biases in the contextualized embeddings produced by modern neural LLMs. Unlike earlier approaches such as the Word Embedding Association Test (WEAT), which operate on static word vectors, CEAT evaluates association strengths at the token level within natural or semantically-modified sentences, providing a more faithful assessment of bias as it manifests in real-world language use and AI-generated content (Peng et al., 19 May 2025, Tan et al., 2019, Guo et al., 2020).

1. Foundations and Motivation

CEAT was motivated by limitations in previous bias-measurement techniques that relied on context-invariant word embeddings (e.g., word2vec, GloVe). As transformer-based LLMs (e.g., BERT, GPT-2, GPT-4o) produce token embeddings that are dependent on their sentence context, bias in these models can only be meaningfully assessed at the contextualized embedding level. CEAT generalizes WEAT by computing differential association scores for groups (e.g., gender, race, nationality) based on how their token-in-context embeddings relate to desirable or undesirable attribute sets, reflecting social and systemic stereotypes observable in large-scale AI outputs (Guo et al., 2020, Tan et al., 2019).

2. Mathematical Formulation

Let $X$ and $Y$ be equally-sized target sets (e.g., demographic groups), and $A$ and $B$ equally-sized attribute sets.

Association Score for Individual Words (contexts):

$s(\mathbf{w},A,B) = \frac{1}{|A|}\sum_{a\in A}\cos\bigl(\mathbf{w},\mathbf{a}\bigr) - \frac{1}{|B|}\sum_{b\in B}\cos\bigl(\mathbf{w},\mathbf{b}\bigr)$

where embeddings are extracted for $\mathbf{w}$ , $\mathbf{a}$ , and $\mathbf{b}$ at their contextualized positions in sentences (Peng et al., 19 May 2025, Tan et al., 2019, Guo et al., 2020).

Effect Size (ES):

$ES(X,Y,A,B) = \frac{ \frac{1}{|X|}\sum_{x\in X}s(x,A,B) - \frac{1}{|Y|}\sum_{y\in Y}s(y,A,B) }{ \sigma_{w\in X\cup Y}[s(w,A,B)] }$

Larger $|ES|$ denotes stronger bias in contextualized associations.

Combined Effect Size (CES):

A random-effects meta-analytic aggregation is performed when bias is measured across multiple natural contexts:

$Y$ 0

with weights $Y$ 1 reflecting within- and between-context variance (Peng et al., 19 May 2025, Guo et al., 2020).

Permutation Test for Statistical Significance:

Label permutations ( $Y$ 2, $Y$ 3) are used to compute $Y$ 4-values:

$Y$ 5

3. Implementation Methodologies

CEAT can be applied either with hand-curated word sets and crafted sentence templates (Tan et al., 2019, Guo et al., 2020), or via automated extraction and real-world corpora.

Standard Pipeline:

Select target and attribute sets $Y$ 6.
Assemble contexts:
- “Bleached” templates (“This is [ ].”) or naturally occurring sentences.
For each target/attribute, extract contextualized embeddings for tokens positioned within these contexts.
Compute $Y$ 7, $Y$ 8, and, when assessing many natural contexts, aggregate into $Y$ 9 using random-effects meta-analysis (Guo et al., 2020).

Automated Pipeline in GenAI Content (RAG+LLM Extraction) (Peng et al., 19 May 2025):

AI-generated scripts are chunked; contextualized embeddings are extracted for target/attribute candidates.
Sets $A$ 0 are generated by prompting modern LLMs (e.g., GPT-4o) to extract demographic group and attribute terms, in a few-shot fashion to eliminate human bias.
Retrieval-Augmented Generation (RAG) framework retrieves relevant script chunks and supplies them to the LLM, which outputs candidate word sets for audit.
The pipeline computes $A$ 1, $A$ 2, and permutation $A$ 3-values, reporting the extent and statistical reliability of observed bias.

Methodological Aspect	Classic CEAT	Automated CEAT (GenAI Content)
Target/Attribute Set Selection	Manual or Fixed Lexica	LLM-generated via prompt engineering
Contexts	Templates or Corpus	AI-generated document chunks
Statistical Aggregation	Meta-analysis	Meta-analysis
Human Subjectivity	Present	Mitigated

4. Applications and Empirical Results

CEAT is capable of auditing various forms of bias, including gender, racial, and intersectional group biases, within the contextualized outputs of neural LLMs. Its integration into RAG+LLM pipelines enables scalable, automated audits of AI-generated educational materials, compliance dashboards, and monitoring in domains such as news summarization, chatbots, and HR document drafting (Peng et al., 19 May 2025).

Empirical evaluations demonstrate:

High alignment between automated and manually curated target/attribute sets (cosine similarity 0.76–0.89).
Near-perfect agreement of $A$ 4 scores between automated/manual pipelines; Pearson $A$ 5 on held-out texts, $A$ 6 (Peng et al., 19 May 2025).
Racial bias often exceeds gender bias in magnitude in contextual LLMs; intersectional (e.g., African-American female) groups exhibit the strongest negative attributions (Tan et al., 2019, Guo et al., 2020).
Biases detected at the token level may not be visible at the sentence-pooling level; 36.6% of significant biases are only observable through token-level CEAT (Tan et al., 2019).

5. Relation to WEAT, SEAT, and Other Bias Tests

WEAT: Measures static association in fixed word embeddings (e.g., word2vec, GloVe); context-free.
SEAT: Computes associations over pooled sentence representations (sentence encoders).
CEAT: Generalizes WEAT/SEAT to contextual embeddings at the token level, offering isolation from confounding sentence or document pooling effects and supporting nuanced, intersectional bias analyses.

Compared to template-based CWE bias measures, CEAT samples from distributions over many natural contexts and applies meta-analytic summarization, thus capturing the heterogeneity and context dependence of bias (Guo et al., 2020).

6. Limitations and Prospects

Reported limitations include:

Dependence on the quality and representativeness of contextual corpora or AI-generated content used for sampling. Sampling from domains with demographic skews (e.g., Reddit) may affect estimates (Guo et al., 2020).
Absence of significant $A$ 7-values is not evidence of no bias; power and detection depend on context diversity, set construction, and N (Tan et al., 2019).
Scope is limited by the lexica or extraction pipeline; extensions to non-binary gender, other intersections (e.g., race $A$ 8age), and richer attribute sets are necessary.
Sensitivity to prompts and retrieval accuracy in automated pipelines (Peng et al., 19 May 2025).
CEAT detects but does not mitigate bias and requires appropriate downstream interventions for debiasing.
Layerwise or meta-regression extensions could enable finer-grained analysis of when and where bias crystallizes in network architectures (Tan et al., 2019, Guo et al., 2020).

7. Significance and Impact

CEAT has enabled the assessment of algorithmic bias in neural LLMs in contexts aligned with their real-world use. Its token-in-context focus captures latent associations missed by sentence-level or static-bias measures and reveals amplified negative attributions toward intersectional identity groups. By integrating CEAT with prompt-based extraction and RAG systems, it is possible to automate large-scale, reproducible audits of AI-generated content for bias, with robust alignment to manual expert judgments (Peng et al., 19 May 2025).

CEAT has been foundational in evidence-based AI ethics research, providing a statistical framework for fair and accountable LLM deployment. Its random-effects model over natural contexts exposes the full distribution of bias effects, allowing for transparent, granular reporting in scientific and policy settings (Guo et al., 2020, Tan et al., 2019).