- The paper demonstrates that automating bias audits with the ITACA_144 tool streamlines compliance while exposing challenges in data relevancy and fairness metrics.
- It reveals that current regulatory provisions, like outdated data use and exclusion of small demographic groups, hinder effective bias detection in AI hiring.
- The study recommends a comprehensive audit approach covering the entire AI system lifecycle to better identify and mitigate effective bias in recruitment processes.
This paper discusses the practical challenges encountered and lessons learned while developing software to automate bias audits for AI hiring systems in compliance with New York City's Local Law 144. The authors, from Eticas.ai, created a tool called ITACA_144, derived from their more comprehensive ITACA_OS platform, to streamline the legally mandated bias audits for Automated Employment Decision Tools (AEDTs).
Local Law 144 requires employers using AEDTs to conduct annual independent bias audits. Eticas.ai's automation effort aimed to make compliance more affordable and transform it into an opportunity for system optimization by identifying and minimizing errors. However, the process highlighted several shortcomings in the law's current formulation and its practical application.
Key learnings and recommendations include:
- Data Requirements: The law lacks specific requirements for the data used in audits, such as its recency or geographical relevance. Audits might use outdated or non-local data.
- Recommendation: Mandate the use of historical data from the last 12 months and ensure it pertains specifically to NYC-relevant hiring processes to reduce temporal and deployment bias.
- Demographic Inclusiveness: The law permits excluding demographic categories representing less than 2% of the audit dataset. This provision often leads to the exclusion of groups like American Indian, Alaska Native, Native Hawaiian, Pacific Islander, and others, potentially overlooking bias against these vulnerable populations.
- Recommendation: Remove the 2% exclusion rule and provide clearer definitions for categories like "Some Other Race".
- Impact Ratio vs. Fairness: Law 144 mandates calculating the Impact Ratio (IR), which measures differences in selection rates between groups. While applicable in production without true labels, IR alone is insufficient to determine fairness. Systems can achieve proportional outcomes (similar selection rates) while still exhibiting discriminatory behavior through proxies or differential treatment.
- Recommendation: Acknowledge the limitations of IR and consider incorporating deeper analyses, like counterfactual fairness checks, to understand if the model ensures equal treatment beyond just proportional outcomes. Clarify whether the law's goal is proportional representation or equal treatment.
- Effective Bias: Focusing solely on model metrics overlooks biases introduced before the model (e.g., biased training data, data curation) and after the model's output (e.g., human decisions in the hiring pipeline). Assessing only the AEDT provides a partial view and cannot capture the effective bias of the entire process.
- Recommendation: Require documentation and transparency throughout the AI system's lifecycle. Audits should ideally capture data from pre-processing, in-processing, and post-processing stages to identify where bias originates and how interventions affect outcomes.
- Metrics: The law references the 80/20 rule (a common threshold for acceptable IR) but doesn't enforce action if a system falls outside this range. Furthermore, other metrics could provide more meaningful guidance.
- Recommendation: Use benchmarks based on representativity (e.g., comparing hiring demographics to census data) and focus on improving representation relative to input data bias. Policymakers should define clear, enforceable metrics that guide developers toward genuinely fairer systems.
- Data Reliability: Audits currently depend on data provided by the entity being audited, creating potential for misrepresentation.
- Recommendation: Regulators should implement random, in-depth spot checks, potentially involving executing the system, to verify the submitted data and deter dishonest reporting, similar to practices in other regulated sectors.
In conclusion, while NYC Local Law 144 is a significant step towards AI accountability in hiring, its current limitations may lead to compliance exercises that don't genuinely improve fairness or safety. The authors advocate for refining the regulatory requirements based on practical auditing experiences to ensure that bias measurement standards are robust, useful, and promote meaningful improvements in AI systems.