Generative AI Susceptibility Index (GAISI)
- Generative AI Susceptibility Index (GAISI) is a metric assessing vulnerability to generative AI, defined differently for educational assessments and labour market analyses.
- In higher education, GAISI combines eight static-analysis components with a dynamic testing score to evaluate assessment security against AI exploitation.
- In the UK labour market, GAISI measures the share of job tasks likely to achieve at least 25% time-savings via LLMs, focusing on marginal productivity gains.
Searching arXiv for the cited papers to ground the article. {"query":"id:(Torkestani et al., 31 May 2025) OR id:(Henseke et al., 30 Jul 2025)","max_results":5} {"query":"(Torkestani et al., 31 May 2025)","max_results":3} {"query":"(Henseke et al., 30 Jul 2025)","max_results":3} “Generative AI Susceptibility Index” (GAISI) is not a single canonical metric but a name used in 2025 for at least two distinct quantitative constructs. In higher education assessment, GAISI denotes a scalar index of assessment vulnerability to generative AI, defined by combining eight static-analysis components with a dynamic-testing score in a weighted linear model (Torkestani et al., 31 May 2025). In labour-market analysis, GAISI denotes a task-based measure of UK job exposure to LLMs, defined as the share of job activities where an LLM or LLM-powered system is judged likely to reduce task completion time by at least 25 per cent beyond existing productivity tools (Henseke et al., 30 Jul 2025). The shared acronym reflects a common concern with susceptibility or exposure to generative AI capabilities, but the underlying objects, data-generating processes, and interpretive targets differ substantially.
1. Nomenclature and domain-specific meaning
The higher-education formulation emerges from a “machine-versus-machine” response to the assessment problems created by systems such as GPT-4, Claude, and Llama. Its motivation is explicitly institutional: traditional assessment methods are described as facing an existential threat, with surveys indicating 74–92% of students experimenting with these tools for academic purposes. The framework is designed to evaluate how vulnerable a given assessment is to generative AI by combining pattern-based inspection with simulation-based testing (Torkestani et al., 31 May 2025).
The labour-market formulation is explicitly task-based and UK-specific. It is intended to capture first-order effects of generative AI on knowledge work by measuring marginal gains in existing tasks rather than wholesale automation. Its score lies on the interval and represents the share of a job’s tasks for which an LLM, alone or integrated into software, is likely to save at least 25% of completion time relative to incumbent productivity tools (Henseke et al., 30 Jul 2025).
| GAISI usage | Unit of analysis | Score meaning |
|---|---|---|
| Higher-education assessment GAISI | An assessment | Vulnerability to generative AI across static and dynamic dimensions |
| UK labour-market GAISI | An individual job | Share of tasks exposed to LLM-enabled time savings |
A plausible implication is that “GAISI” should be interpreted only in conjunction with its domain. Without that qualifier, the acronym is ambiguous.
2. Higher-education assessment GAISI: formal structure
In the assessment framework, GAISI is defined as a single scalar combining eight static-analysis sub-scores and one dynamic-testing score :
Here, each is a normalized static-analysis score, is the normalized dynamic-testing score, and the weights and are determined under the paper’s weighting framework (Torkestani et al., 31 May 2025).
The eight static-analysis components are specificity and contextualization, temporal relevance, process visibility requirements, personalization elements, resource accessibility, multimodal integration, ethical reasoning requirements, and collaborative elements. Each is justified as exploiting a limitation in contemporary generative AI. Specificity and contextualization are motivated by the claim that LLMs rely on high-frequency training patterns and struggle with unique, course-specific contexts. Temporal relevance targets model knowledge cutoffs. Process visibility requirements target the gap between polished output and authentic iterative thinking. Personalization elements exploit the absence of genuine personal experience. Resource accessibility targets the inability to fetch non-public or program-specific materials. Multimodal integration targets cross-modal synthesis difficulties. Ethical reasoning requirements distinguish recitation of frameworks from genuine moral agency. Collaborative elements target the inability to negotiate or adapt in real time with human partners (Torkestani et al., 31 May 2025).
Several of these components are given explicit sub-models. For specificity and contextualization,
The three rubric-based parts are topical, contextual, and analytical specificity. For process visibility requirements,
where 0 is the number of distinct process artifacts required and 1 is the maximum possible. For personalization elements,
2
For multimodal integration,
3
For ethical reasoning requirements and collaborative elements, the paper specifies analogous weighted decompositions:
4
5
Temporal relevance is measured using weighted ratios of post-cutoff references and recent events or cases, with 6-weights summing to 1. Resource accessibility is measured using weighted ratios of exclusive resources and course-only documents, with an optional term for format diversity. The framework therefore mixes rubric-derived judgment, normalized counts, and weighted compositional scoring rather than relying on a single rubric template (Torkestani et al., 31 May 2025).
3. Dynamic testing, calibration, and operational workflow in assessment
The dynamic-testing component complements the static profile through simulation-based vulnerability assessment. The prescribed approach is to “red-team” an assessment by having several LLMs, including GPT-4, Claude, and Llama, attempt it under realistic student-AI workflows such as iterative prompting, retrieval-augmented generation, and chain-of-thought. This is intended to address limitations in purely pattern-based analysis (Torkestani et al., 31 May 2025).
The dynamic score 7 is defined over normalized metrics for accuracy 8, prompt iterations 9, coverage 0, and robustness 1:
2
Accuracy is defined as match against expert rubric; prompt iterations are the inverse of the number of prompts needed to reach acceptable quality; coverage is the proportion of rubric criteria addressed; robustness is the performance drop under adversarial prompts. This formulation makes the assessment GAISI partly an adversarial benchmarking instrument rather than only a structural design checklist (Torkestani et al., 31 May 2025).
Weight selection is handled through three routes: empirical weighting, expert-judgment weighting, and concept-derivation. Empirical weighting regresses static scores against observed LLM success rates over a corpus of assessments. Expert-judgment weighting uses Delphi panels of assessment designers and AI experts to assign weights. Concept-derivation bases weights on fundamental LLM limitations, such as assigning a high weight to temporal relevance while models lack recency. Threshold determination is expressed through a traffic-light system using 3 and 4:
5
The thresholds may be set through empirical breakpoints in LLM performance curves, normative risk-tolerance, or pragmatic resource constraints. Figure 1 is specified as a radar-chart visualization for static profiles (Torkestani et al., 31 May 2025).
The operational workflow is seven-step: select assessments; score the eight static elements; plot the radar chart; execute adversarial LLM runs; compute 6; compute GAISI; compare against thresholds; redesign the highest-impact elements and repeat the cycle over time. Best practices include involving cross-functional teams, calibrating rubrics and weights via pilot studies and local data, balancing security with equity, regularly updating thresholds and weights, and documenting process visibility and data for institutional audit. The paper also states explicit limitations: dynamic tests are resource-intensive, over-specific rubrics can reduce transferability of learning, thresholds and weights may require frequent recalibration as LLMs advance, personalization and resource constraints may disadvantage some student groups, and GAISI should not be mistaken for a definitive guarantee of security (Torkestani et al., 31 May 2025).
4. UK labour-market GAISI: probabilistic task construction
In the UK labour-market application, GAISI is built in two stages: LLM-driven probabilistic ratings at the occupation-task cell and aggregation to each worker’s job via importance weights. The underlying task space consists of 44 generic work-activity items, grouped into 12 categories, crossed with 25 UK SOC 2010 sub-major occupations. For each occupation-task cell 7, an LLM is prompted five times to return a probability distribution over four exposure levels 8: no meaningful exposure, direct LLM exposure, latent exposure, and multi-modal exposure (Henseke et al., 30 Jul 2025).
The model-average probability is
9
Worker-level aggregation uses the Skills & Employment Survey (SES) importance scale, mapped as essential 0, very important 1, fairly 2, not very 3, and not at all 4. If respondent 5 reports task importance 6 and total importance-equivalent task load 7, then the importance-weighted share of tasks in exposure category 8 is
9
The final index is
0
with discount weight 1 following Eloundou et al. (2024). By construction, 2, and 3 (Henseke et al., 30 Jul 2025).
The exposure criterion is explicitly a 25% time-saving threshold relative to existing productivity tools, rather than Eloundou et al.’s 50% rule. The raters are instructed to consider task function, typical workflows in the occupational vignette, existing tools such as word-processors and databases, and the incremental capabilities of LLMs alone or when embedded in applications. The use of full distributions 4 is intended to capture fractional exposure and uncertainty rather than a hard yes/no classification (Henseke et al., 30 Jul 2025).
The empirical architecture is built on the SES 2023–24, a representative cross-section of 5,784 British workers aged 20–65, together with task vignettes for the 25 occupational groups, monthly vacancy data by 3-digit SOC 2020 × local authority from the Office for National Statistics for January 2017 to May 2025, and the OECD Survey of Adult Skills (PIAAC) 2023 as an external benchmark. Google Gemini 1.5 Pro is the main rater; extensions use GPT-4o and Gemini 2.5 Pro. Prompts follow a chain-of-thought structure, include few-shot examples, hold temperature at 0.2, and guide the LLM to justify each probability (Henseke et al., 30 Jul 2025).
5. Reliability, validity, and comparative performance
The labour-market GAISI is evaluated using Messick’s six-fold framework: content, substantive, structural, external, generalisability, and consequential validity. Structural reliability is reported using two-way random-effects intraclass correlation coefficients for absolute agreement across five stochastic runs. The single-rating ICC(A,1) values are 5 for 6, 7 for 8, and 9 for 0; the average-rating ICC(A,5) values are 1 for 2, 3 for 4, and 5 for 6. These figures are interpreted in the paper as near-perfect agreement (Henseke et al., 30 Jul 2025).
Content validity is supported in two ways. First, a heat-map of dominant exposure classes across the 12 task categories shows non-exposed anchors such as manual tasks and exposed anchors such as writing long documents. Of the 44 tasks, 30 are predominantly 7, 8 are 8, and 6 are 9. Second, the SES task battery explains on average 30% of working hours, with 0 from regressing weekly hours on task load, ranging from 13% for machine operators to 50% for managers. The paper interprets this as evidence that the battery covers a substantial and relevant share of job content (Henseke et al., 30 Jul 2025).
Substantive validity is tested by coding 5,500 LLM justifications for affordances, integration conditions, and human constraints. A regression of task-level 1 on binary tags yields 2. Each additional affordance tag raises GAISI by 3 with a diminishing-returns squared term of 4; integration cues add 5; human constraint mentions and hedging language each reduce scores by approximately 6–7. External validity is assessed through Spearman correlations with established exposure measures: 8 with Felten et al.’s general AI exposure, 9 with FRS LLM-specific exposure, 0 with Webb’s AI patents index, and 1 with Brynjolfsson et al.’s machine-learning suitability measure. Negative correlations with Webb’s robotics index (2) and Frey and Osborne’s automation index (3) are presented as evidence of discriminant validity (Henseke et al., 30 Jul 2025).
Predictive validity is assessed using self-reported AI use in SES 2023–24. A one-standard-deviation increase in GAISI yields an average marginal effect of 4 percentage points on reported AI use, controlling for demographics, and the area under the ROC curve is 5 with standard error 6. The paper states that this exceeds the 7 benchmark for useful discrimination. In joint models with FRS, Webb, BMR, and FO measures, GAISI remains highly significant and outperforms a binary EMMR rubric, whose AUC is reported as 8. Robustness checks vary the latent-exposure weight from 9 to 0, change prompts, switch rater models, or omit few-shot examples; these produce mean GAISI values from 1 to 2, Spearman correlations above 3 with the benchmark, and AUC values from 4 to 5. Portability to the OECD PIAAC 2023 yields country-occupation GAISI scores with correlation 6 and similar distributions. Consequential-validity checks show negligible associations, with standardized coefficients below 7, between residualized worker-level GAISI and sex, age, ethnicity, and education (Henseke et al., 30 Jul 2025).
6. Empirical findings, interpretation, and recurrent misconceptions
The labour-market GAISI is used to trace changes from 2017 to 2023–24 and to link exposure to wages and vacancies. Mean GAISI rises from approximately 8 to 9, a change of 0 percentage points, and the decomposition attributes virtually all growth to employment shifts toward more susceptible occupations rather than within-occupation change. Occupations with higher 2017 GAISI exhibit faster employment growth to 2023–24, with correlation 1 (Henseke et al., 30 Jul 2025).
Within-occupation wage regressions report a log-hourly-pay premium of 2 per one-standard-deviation increase in GAISI in 2017 and 3 in 2023–24, both significant at conventional levels. A difference-in-differences estimate indicates that the premium for high-GAISI jobs fell by 4 relative to low-GAISI jobs. Vacancy analysis using a panel of postings from January 2017 to May 2025 reports a stable positive vacancy-exposure relationship before GPT release, but from 2022 Q4 to 2025 Q2 a 5-point rise in GAISI is associated with a 23% reduction in postings, or 7.6% per 0.1 GAISI. For a 6 interquartile-range shift, high-exposure occupations are estimated to have posted 74,000 fewer ads by May 2025 than pre-GPT trends implied, approximately a 6.5% shortfall (Henseke et al., 30 Jul 2025).
The paper’s interpretation is that displacement effects may already outweigh productivity gains in the short to medium term. At the same time, it reports that the average exposure level is moderate, around 7, and only 13% of jobs exceed a 8 threshold. This combination matters for interpretation: the index is not presented as evidence of universal automation, but as a granular measure of exposure to LLM-enabled time savings (Henseke et al., 30 Jul 2025).
A recurrent misconception is to treat both GAISI formulations as if they measured the same phenomenon. They do not. The assessment GAISI measures vulnerability of assessment designs to generative AI-assisted completion and explicitly warns against false precision, stating that GAISI is an index rather than a definitive guarantee of security. The labour-market GAISI measures exposure of job tasks to at least 25% time savings beyond existing tools and explicitly focuses on marginal gains in existing tasks rather than wholesale automation (Torkestani et al., 31 May 2025). This suggests that the common acronym denotes a broader family of susceptibility metrics whose substantive meaning is fixed by the workflow, thresholds, and unit of analysis chosen in each domain.