Generative AI Literacy Assessment Test

Updated 5 February 2026

Generative AI Literacy Assessment Test (GLAT) is a standardized, psychometrically validated tool designed to objectively measure core competencies such as prompt engineering, ethics, and output evaluation.
It employs a diverse item set including multiple-choice questions, open-ended tasks, and practical exercises generated through both expert review and automated LLM-based methods.
Empirical validation using classical test theory and IRT confirms GLAT's robust reliability and validity, supporting its application in educational benchmarking and workforce upskilling.

The Generative AI Literacy Assessment Test (GLAT) is a performance-based, psychometrically validated instrument designed to objectively measure competencies essential for effective interaction, evaluation, and responsible application of generative artificial intelligence systems. GLAT is informed by multidimensional competency frameworks, psychometric models, and automated question generation with LLMs, providing a standardized approach for educational, workforce, and research settings to assess generative AI literacy (Jin et al., 2024, Annapureddy et al., 2024, Markus et al., 17 Mar 2025, Wang et al., 2024, Li et al., 16 Mar 2025, Maity et al., 2024, Hwang et al., 2023).

1. Conceptual Foundations and Competency Models

GLAT is grounded in contemporary definitions of AI literacy, specifically as "the ability to critically evaluate, effectively interact with, and meaningfully use AI technologies across multiple domains of life" (Markus et al., 17 Mar 2025). For generative AI, the competency model is expanded to comprise twelve core domains, encompassing basic AI knowledge, prompt engineering, model adaptation, ethical/legal considerations, detection and evaluation of AI-generated outputs, and continuous learning (Annapureddy et al., 2024).

Competency domains may be formalized as:

Understanding AI: Concepts of ML/DL, distinctions from classical programming.
Generative Model Knowledge: Architectures (GANs, VAEs, transformers), training paradigms.
Tool Use & Prompting: GUI navigation, parameter tuning, use of system/few-shot/chain-of-thought prompts.
Capacity and Limitation Awareness: Recognition of hallucinations, bias, security.
Output Evaluation: Fact-checking, relevance, bias identification.
Programming/Fine-Tuning: Scripting, model metric calculation (e.g., perplexity).
Ethics/Legal: Transparency, accountability, legislative frameworks.
Application Contexts and Continuous Learning.

2. Test Design: Item Types and Automated Generation

GLAT leverages a pluralistic approach to item construction, supported both by rigorous manual item development and automated, scalable LLM-based generation. Item formats include:

Multiple-choice questions (MCQ): Each item targets a specific competency, with keys and distractors vetted for technical accuracy and plausibility using both expert review and critique-agent workflows (Wang et al., 2024, Jin et al., 2024).
Open-ended performance tasks: Elicit written responses, scored via rubrics to assess communication, creativity, stepwise collaboration, and evaluative acumen (Li et al., 16 Mar 2025, Hwang et al., 2023).
Practical/lab exercises: Tool navigation, programming tasks, detector tool use (Annapureddy et al., 2024).
Scenario-based questions: Require application of generative AI concepts to domain-specific problems (Maity et al., 2024).

Automated MCQ generation with LLMs incorporates multi-agent architectures: generator agents instantiate items aligned to user-specified learning objectives, Bloom’s Taxonomy level, and grade; critique agents screen for linguistic clarity and item-writing flaws; a supervisor aggregates flaw counts and routes iterative revision (Wang et al., 2024). This process is formalized algorithmically to ensure scalable, context-appropriate, pedagogically valid item banks.

3. Psychometric Validation and Scoring

GLAT employs classical test theory (CTT) and item response theory (IRT) for structural validation:

Internal consistency is measured via Cronbach’s alpha and McDonald’s omega (e.g., $\alpha = 0.80$ , $\omega_\text{total} = 0.81$ for a 20-item GLAT), with thresholds for psychometric adequacy set at $\alpha, \omega \geq 0.70$ (Jin et al., 2024, Li et al., 16 Mar 2025, Markus et al., 17 Mar 2025).
Construct validity: Factor analyses (exploratory, confirmatory) confirm unidimensional or multidimensional models, with fit indices RMSEA $<$ 0.05, CFI $>$ 0.95 considered exemplary (e.g., RMSEA = 0.03; CFI = 0.97) (Jin et al., 2024, Markus et al., 17 Mar 2025, Li et al., 16 Mar 2025).
IRT Calibration: 2PL models estimate discrimination ( $a_i$ ) and difficulty ( $b_i$ ) parameters for MCQs:

$P(X_{ij}=1 \mid \theta_j) = \frac{1}{1 + \exp\bigl(-a_i(\theta_j - b_i)\bigr)}$

Items outside $b \in [-3,3]$ , $a < 0.35$ or $a > 2.5$ may be excluded unless required for domain coverage (Markus et al., 17 Mar 2025).

Rubric scoring for open tasks: Each criteria (clarity, specificity, novelty, relevance, iterative decomposition) is rated on a standardized scale (e.g., 1–5 or 0–3 points per subskill) (Li et al., 16 Mar 2025, Hwang et al., 2023, Annapureddy et al., 2024). Composite scores are either summed or standardized, benchmarked into proficiency bands (e.g., z-score thresholds) for interpretive reporting.

4. Domains, Subdomains, and Sample Assessment Strategies

The twelve-defining-competency model for generative AI literacy guides both item content and assessment structure (Annapureddy et al., 2024):

Competency Domain	Representative Task	Item Format / Rubric Criteria
Prompt Engineering	Write a chain-of-thought solution	Open response, scored on clarity, structure, instructions
Output Evaluation	Spot factual errors, state bias	MCQ, short answer, scored on accuracy/relevance
Programming & Fine-Tuning	Script GPT-2 tuning exercise	Code lab, metric calculation, pass/fail + rubric
Detect AI-generated Content	Classify 10 snippets as AI/human	Binary classification, clue justification
Ethics, Legal, Contexts	Critique deepfake scenario	Case study, policy memo, scored on principle application
Continuous Learning	Learning journal, goals worksheet	Reflective log, SMART goal rubric

This assessment strategy is designed for technical audiences: MCQs probe foundational and in-depth knowledge (architectures, limitations, ethical risk), performance tasks assess real-world tool use, and open-ended items emphasize evaluative and generative creativity.

5. Administration, Scoring, and Adaptive Mechanisms

GLAT administration typically occurs via a web-based, timed interface, with 18–20 items spanning the core domains. Proctoring may include screen monitoring or honor codes. Automated scoring for MCQs is immediate; open-ended tasks are scored with rubric scripts (LLM-augmented where feasible), plus human spot checks for edge cases (Li et al., 16 Mar 2025, Jin et al., 2024).

Difficulty adaptation is supported in frameworks utilizing automated question generation: algorithms escalate the cognitive level if scores exceed proficiency thresholds (e.g., $\geq 80\%$ ), and provide scaffolded supports for scores below minimum cut-offs (Maity et al., 2024).

Formal equations for reliability, validity, and item calibration—such as Cronbach’s alpha, composite reliability, RMSEA, CFI, and 2PL item models—are supplied in implementation guides (Jin et al., 2024, Markus et al., 17 Mar 2025, Li et al., 16 Mar 2025). GLAT offers both short- and long-form options, with short versions selected for maximal item loading and balanced content coverage (Markus et al., 17 Mar 2025).

6. Empirical Results, Validity, and Impact

Empirical validation studies demonstrate GLAT’s predictive and criterion validity: GLAT scores robustly explain variance in generative-AI powered task performance, outperforming self-report proxies (e.g., perceived ChatGPT proficiency) (Jin et al., 2024). Regression analyses confirm that latent generative AI literacy (A-factor) predicts performance on complex, language-based creative tasks, accounting for domain specificity and individual differences (Li et al., 16 Mar 2025).

Reported reliability coefficients ( $\alpha_{\mathrm{total}} = 0.80-0.87$ ) and model fit (CFI $\approx$ 0.97, RMSEA $\approx$ 0.03) meet or exceed standard thresholds for educational and psychological instruments (Jin et al., 2024, Markus et al., 17 Mar 2025, Li et al., 16 Mar 2025).

Suggested educational applications include course benchmarking, targeted intervention (prompt tutorials, critical evaluation workshops), curriculum tracking for skill development, policy-level GenAI literacy requirements, and real-time feedback in workforce upskilling programs (Annapureddy et al., 2024, Jin et al., 2024, Markus et al., 17 Mar 2025, Li et al., 16 Mar 2025).

7. Limitations, Future Directions, and Recommendations

Current GLAT implementations focus primarily on MCQs and open-ended tasks, without systematic support for project-based or coding assessments in the core workflow (Wang et al., 2024). Real-world classroom trials and longitudinal studies of impact remain to be conducted. Item banks require continuous review for technical currency due to rapid advances in generative models and associated skills (Markus et al., 17 Mar 2025, Annapureddy et al., 2024).

Recommended future directions:

Expansion to adaptive, data-driven difficulty calibration.
Integration of backward-design and domain-specific modules.
Deployment of crowd-sourced feedback and student performance data to refine item pools in real time.
Cross-cultural adaptation and measurement invariance studies.
Extension to younger learners, workforce roles, and specialized domains (healthcare, legal, STEM) (Li et al., 16 Mar 2025, Annapureddy et al., 2024).

GLAT is positioned as the field-standard for rigorous, multidimensional, and scalable assessment of generative AI literacy, providing technical benchmarks and interpretive guidance for educational institutions, professional development, and policy-making contexts.