Idealized CAT (ICAT) Score
- Idealized CAT (ICAT) Score is a metric that combines language model fluency and neutrality, using language model score (lms) and stereotype score (ss) to provide a single interpretable value.
- It also functions in binary classification by computing precision at an empirically determined indistinguishability threshold, making it robust to issues like class imbalance.
- ICAT overcomes limitations of traditional metrics such as AUC and F1 by offering a mathematically rigorous evaluation for both language bias and classifier performance.
The Idealized CAT (ICAT) Score is a metric designed to offer a principled, interpretable summary of model performance by addressing challenges in both LLM bias measurement and binary classification evaluation. The term "ICAT" represents two formally unrelated metrics from distinct lines of research: one for assessing LLM stereotyping via meaningful/irrelevant triplets, and another for classifier evaluation via indistinguishability precision at a principled threshold. Despite their separate origins, both share rigorous mathematical grounding and address limitations found in alternatives such as AUC or F1. This article delineates both definitions with complete technical fidelity.
1. ICAT for LLM Bias: Definition and Rationale
In evaluation of stereotypical bias in LLMs, the Idealized CAT Score (iCAT) is formulated to capture a tradeoff between language modeling competence and neutrality between stereotypical and anti-stereotypical completions (Pang et al., 2 Feb 2025). Given a test set —where denotes a stereotypical, an anti-stereotypical, and an irrelevant sentence—the iCAT metric is built on two quantities:
- LLM Score (lms):
is the model (pseudo-)log-likelihood.
- Stereotype Score (ss):
The iCAT score itself is: This yields a value in , maximized only when the model always assigns highest probability to a meaningful (either stereotyped or anti-stereotyped) completion and shows perfect neutrality (ss = 50). The symmetry inherent in ensures indifference between stereotyped and anti-stereotyped preferences is rewarded.
This design simultaneously penalizes LLMs that are either biased (ss far from 50) or lack discriminative power (low lms), producing a single-number summary that reflects both criteria.
2. ICAT in Classification: Precision at the Indistinguishability Threshold
For evaluating binary classifiers, the Idealized CAT Score (ICAT) is independently defined as precision at the "indistinguishability threshold" (Sumpter, 2023). For a classifier assigning real-valued scores to all instances, let 0 be the empirical distribution of true positive scores and 1 the distribution for true negatives.
- Indistinguishability Threshold 2:
3 is the unique solution to:
4
Formally,
5
with 6 and 7.
- ICAT Score (Precision at 8):
9
That is, the precision when the threshold is set such that positively-labeled items are statistically indistinguishable from true positives in pairwise comparisons.
3. Step-by-Step Computation Procedures
iCAT (LLM Bias)
- For each triplet, compute likelihoods 0, 1, 2.
- Compute lms: fraction where 3.
- Compute ss: fraction where 4.
- Compute iCAT as 5.
ICAT (Classification)
- Sort unique classifier scores.
- For each candidate threshold 6:
- Compute 7 as the positive-label survival fraction above 8.
- Compute 9 for negatives.
- Evaluate 0 as above.
- Find 1 where 2 by interpolation.
- Compute precision at 3 for the final ICAT score.
4. Interpretation and Numerical Behavior
For LLMs, iCAT values near 100 indicate both high fluency and neutrality (ss ≈ 50, lms ≈ 100). iCAT collapses to zero for models that are either always biased (ss near 0 or 100) or lack the ability to score meaningful completions above irrelevant ones (lms ≈ 0). Mid-range values reflect partial failures in either attribute.
For classification, ICAT tracks the fraction of predicted positives that are true positives at the threshold where predicted positives are "statistically indistinguishable" from true positives, as formalized via 4. Unlike AUC, ICAT is invariant to strict monotonic rescaling of scores and robust to class imbalance, since the balancing property absorbs the effect of "trivial negatives". In experimental settings with varying overlap between positive and negative distributions, ICAT reflects actual discriminative difficulty rather than being artificially inflated by class distribution (Sumpter, 2023).
5. Comparison to Related Metrics
| Metric | Domain | Core Principle |
|---|---|---|
| CAT | Bias eval (CrowS-Pairs) | Biased preference rate, ignores irrelevance & fluency |
| iCAT | Bias eval (StereoSet, LIBRA) | Combines language ability (lms) and neutrality (ss) |
| EiCAT | Bias eval (LIBRA) | Incorporates JSD divergence and local-word knowledge penalty (bbs) |
| AUC | Classification | Probability positive ranked above negative, threshold independent |
| F1 | Classification | Harmonic mean of precision/recall, ad hoc threshold |
| ICAT | Classification | Precision at indistinguishability threshold (B=0.5), robust to label imbalance |
iCAT in LLM bias subsumes ss (CAT Score) and penalizes lack of fluency, while EiCAT (from LIBRA) further incorporates Jensen–Shannon divergence and a "beyond knowledge boundary score" (bbs) to address context in which unfamiliar terms impede meaningful bias measurement (Pang et al., 2 Feb 2025). In classification, ICAT avoids artifacts affecting AUC and F1 by anchoring threshold choice to empirical indistinguishability.
6. Illustrative Examples
LLM Bias Example
Given 5 triplets, suppose:
- lms = 75 (i.e., model prefers a meaningful option 3/4 times)
- ss = 50 (equal preference for stereotype and anti-stereotype)
Then: 6 If a model is maximally fluent but completely biased (ss=100): 7
Classification ICAT Example (from (Sumpter, 2023))
In artificial datasets with varying overlap between normals 8:
- "Easy" regime: ICAT ≈ 0.85
- "Moderate": ICAT ≈ 0.69
- "Hard": ICAT ≈ 0.50
These reflect intrinsic difficulty and remain stable under label-imbalance manipulations.
7. Strengths, Limitations, and Extensions
Strengths:
- iCAT: Integrates fairness and language discrimination; symmetric; transparent computation; penalizes extreme preference or lack of competence (Pang et al., 2 Feb 2025).
- ICAT: Robust to class imbalance and "easy negatives"; anchored threshold yields direct interpretability; immune to pitfalls of AUC/F1 (Sumpter, 2023).
Limitations:
- iCAT: Reduces full distributional information to two statistics (ss, lms); may miss preference strength nuance; requires triplet format with a crafted irrelevant case; all test cases equally weighted, lacking stereotype severity adaptation.
- ICAT: Focuses on the single indistinguishability point, does not capture full sensitivity/recall tradeoff.
Extensions:
- EiCAT: Combines iCAT’s lms with JSD-based divergence and bbs to measure local context comprehension and bias distributionally.
- For ICAT, the indistinguishability criterion may be replaced by 9 for other balances between positive and predicted classes, generalizing the notion of controlled tradeoff.
A plausible implication is that both ICAT formulations serve as templates for single-number metrics that are robust to frequent artifacts affecting more commonly used measures, provided construction and application align with their rigorous criteria.