Responsible AI for Test Equity and Quality: The Duolingo English Test as a Case Study (2409.07476v1)

Published 28 Aug 2024 in cs.CY, cs.AI, and cs.CL

Abstract: AI creates opportunities for assessments, such as efficiencies for item generation and scoring of spoken and written responses. At the same time, it poses risks (such as bias in AI-generated item content). Responsible AI (RAI) practices aim to mitigate risks associated with AI. This chapter addresses the critical role of RAI practices in achieving test quality (appropriateness of test score inferences), and test equity (fairness to all test takers). To illustrate, the chapter presents a case study using the Duolingo English Test (DET), an AI-powered, high-stakes English language assessment. The chapter discusses the DET RAI standards, their development and their relationship to domain-agnostic RAI principles. Further, it provides examples of specific RAI practices, showing how these practices meaningfully address the ethical principles of validity and reliability, fairness, privacy and security, and transparency and accountability standards to ensure test equity and quality.

Summary

The paper presents the Duolingo English Test's Responsible AI standards as a case study for applying RAI practices, aligned with the NIST AI RMF, to enhance test quality and equity in online assessments.
The DET's RAI standards address Validity/Reliability (ensuring test suitability and scoring accuracy), Fairness (promoting access, inclusion, and mitigating bias), Privacy/Security (protecting data and preventing cheating), and Accountability/Transparency (building stakeholder trust).
Implementing these standards involves human-in-the-loop practices, specific methods like evaluating AI scoring accuracy, fairness/bias reviews, and documentation aligned with NIST AI RMF trustworthiness characteristics.

This chapter addresses Responsible AI (RAI) practices in AI-powered assessments, emphasizing test quality and equity using the Duolingo English Test (DET) as a case paper. It presents the DET RAI standards, their development, and alignment with the National Institute for Standards and Technology's (NIST) Artificial Intelligence Risk Management Framework (AI RMF).

The paper highlights the use of AI in assessments for automated scoring and item generation, while also acknowledging the risks of bias in AI-generated content. It advocates for aligning AI-powered assessments with human-centered AI values through responsible AI guidelines and standards. The paper emphasizes that while traditional assessment standards address AI to some extent, the expanded use of AI requires more comprehensive RAI standards to mitigate risks to test validity.

The authors define test quality as the suitability of an assessment for its intended purpose, supported by evidence gathering. Test equity is defined as fairness in test scores, ensuring no group is favored or disadvantaged. Argument-based test validity theory supports these principles through inferences such as domain definition, evaluation, generalization, explanation, extrapolation, and utilization.

The chapter uses the DET as a case paper, describing it as a digital-first, high-stakes English language assessment that employs AI for item generation, writing and speaking evaluation, and plagiarism detection. The DET also uses human-in-the-loop (HiTL) AI practices, ensuring human oversight at critical decision points.

The DET's Responsible AI standards include four key principles:

Validity and Reliability: Ensuring the test is suitable for its intended purpose, evaluating construct relevance and accuracy, and focusing on consistency.
Fairness: Promoting democratization and social justice through increased access, accommodations, inclusion, representative test-taker demographics, and avoiding biased algorithms.
Privacy and Security: Ensuring compliance with data protection laws, test-taker privacy, and secure test administration.
Accountability and Transparency: Gaining stakeholder trust through proper governance and documentation of AI use.

The development of these standards involved a literature review of AI ethical principles, validation against assessment-specific standards, consultation with experts, and publication as a living document open for public comment.

The paper validates the DET RAI standards against the NIST AI RMF, aligning the DET's ethical principles with NIST's trustworthiness characteristics. The NIST AI RMF emphasizes characteristics such as validity, reliability, safety, security, resilience, accountability, transparency, explainability, interpretability, privacy, and fairness.

The authors illustrate the application of the DET RAI standards through practices associated with their goals, using the Interactive Reading and Writing Sample tasks as examples. They outline a six-step RAI process for DET task design, addressing aspects of measurement and security.

Validity and Reliability

The Validity and Reliability Standard focuses on test validity and reliability, with goals to specify processes for building a validity argument and evaluating AI used in test item creation, calibration, and scoring. Subgoals include:

Developing a description for the test target domain to ensure test items align with the measured domain.
- This involves human subject-matter experts (SME) articulating the target construct (e.g., academic reading) and specifying tasks and scoring systems.
- Passages generated for the Interactive Reading task include expository and narrative texts, and reading for orientation and information is operationalized through specific item types.
Evaluating AI scoring system accuracy and fairness, leveraging human expertise.
- This is exemplified in the Writing Sample task, where automated writing evaluation (AWE) is used to score responses.
- Human experts develop rubrics consistent with the Common European Framework of Reference (CEFR) levels, and agreement rates between human raters are monitored.
Developing explainable scoring methods and interpretable AI features that align with domain constructs.
- The DET uses AI models to score open-ended responses, with human experts identifying features to be included in the AI model, drawing from Natural Language Processing (NLP), linguistics, and AWE literature.
- Features for the Writing Sample task include inverse document frequency (IDF) weighted words, sentence overlap, coreference counts, latent semantic analysis (LSA), proportion of words by CEFR level, differential word use (DWU), and tree depth statistics.
- SHapley Additive explanations (SHAP) are used to make scoring models explainable.
Identifying AI methods for item creation, leveraging human expertise to efficiently create valid and reliable test items.
- This manages automated item generation to create large item banks using GPT-4.
- Machine learning and assessment scientists collaborate to develop prompts that align with task specifications.

Fairness

The Fairness Standard addresses test equity, aiming to ensure equal opportunity and algorithmic bias-free AI. Goals include specifying how AI facilitates test-taker access, accessibility, and inclusion, and specifying test-taker demographic representation and algorithms known to contain or generate bias. Subgoals include:

Developing and applying fairness and bias item review principles for inclusion that eliminate construct-irrelevant barriers.
- Humans review content and tasks to identify sensitive content and low-quality items, improving the item design process and prompt development based on human review and feedback.
- Fairness and bias (FAB) review ensures items are factually accurate and do not contain culturally sensitive topics.
Evaluating and documenting demographic representation in datasets used to build AI.
- Writing Sample responses are sampled to include a roughly equal number of men and women from different L1 backgrounds.
Evaluating and documenting bias associated with automatically-generated item content and proficiency measurement.
- Differential Rater Functioning (DRF) analysis is used on all scorers for open-ended tasks, and differential item functioning (DIF) is used to detect bias at the item level.

Privacy and Security

The Privacy and Security Standard ensures secure, fair, and reliable test administration while protecting test-taker privacy and preventing cheating. Goals include specifying methods to ensure privacy and security associated with data, maintaining test-taker privacy and security, and specifying fair and reliable test security proctoring protocols. Subgoals include:

Defining, documenting, and implementing human-in-the-loop AI proctoring protocols that fairly and reliably identify novel and known cheating behaviors.
- AI is used to compare test-taker writing responses to a database of relevant Internet content and writing responses from historical test sessions, with matches flagged and shown to proctors.

Accountability and Transparency

The Accountability and Transparency Standard seeks to build trust with stakeholders through documentation and dissemination of AI use. Goals include assessing how AI processes impact stakeholders, documenting how AI is used for building the validity argument, and disseminating research about AI use to various stakeholder communities. Subgoals include:

Documenting external factors that result in a need to modify AI.
- The Analytics for Quality Assurance for Assessment (AQuAA) system provides weekly reports on relevant metrics that reflect the quality and comparability of test scores over time.
Documenting AI used for building the validity argument, test item creation, calibration, and scoring.
- The DET documents and controls AI use through its Exam Change Proposal (ECP) process.
Disseminating research about the use of AI to various stakeholder communities.
- DET researchers disseminate research in the form of blog posts, white papers, and peer-reviewed articles.

The paper concludes by discussing limitations and future work, including addressing new AI advances like GPT-4o, making fairness a cross-standard narrative, and including additional RAI standards covering environmental and labor impacts.

PDF Markdown

Responsible AI for Test Equity and Quality: The Duolingo English Test as a Case Study (2409.07476v1)

Summary

Related Papers