Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Does AI help humans make better decisions? A statistical evaluation framework for experimental and observational studies (2403.12108v3)

Published 18 Mar 2024 in cs.AI, econ.GN, q-fin.EC, stat.AP, and stat.ME

Abstract: The use of AI, or more generally data-driven algorithms, has become ubiquitous in today's society. Yet, in many cases and especially when stakes are high, humans still make final decisions. The critical question, therefore, is whether AI helps humans make better decisions compared to a human-alone or AI-alone system. We introduce a new methodological framework to empirically answer this question with a minimal set of assumptions. We measure a decision maker's ability to make correct decisions using standard classification metrics based on the baseline potential outcome. We consider a single-blinded and unconfounded treatment assignment, where the provision of AI-generated recommendations is assumed to be randomized across cases with humans making final decisions. Under this study design, we show how to compare the performance of three alternative decision-making systems--human-alone, human-with-AI, and AI-alone. Importantly, the AI-alone system includes any individualized treatment assignment, including those that are not used in the original study. We also show when AI recommendations should be provided to a human-decision maker, and when one should follow such recommendations. We apply the proposed methodology to our own randomized controlled trial evaluating a pretrial risk assessment instrument. We find that the risk assessment recommendations do not improve the classification accuracy of a judge's decision to impose cash bail. Furthermore, we find that replacing a human judge with algorithms--the risk assessment score and a LLM in particular--leads to a worse classification performance.

Evaluating the Impact of AI Recommendations on Human Decision-Making: Experimental Evidence from Pretrial Decision Decisions

Introduction to the Methodological Framework and Experimental Design

A novel methodological framework is introduced to experimentally evaluate whether AI-generated recommendations improve human decision-making compared to decisions made by humans alone or AI alone. This work navigates the challenging terrain of selective labels, where the outcomes of interest are inherently conditioned on the decisions made. Leveraging a single-blinded experimental design, this paper randomizes the provision of AI recommendations to human decision-makers, thus maintaining the integrity of the experimental setup and ensuring that the effects of AI recommendations are isolated through their influence on human decisions.

The Experimental Context and Findings

The paper is grounded in an empirical analysis involving a randomized controlled trial (RCT) assessing the impact of AI-generated predisposition risk assessment (called the PSA) on judges’ decisions regarding cash bail versus signature bond at a criminal first appearance hearing. The findings reveal a lack of significant improvement in the classification accuracy of judges' decisions when provided with AI recommendations. Moreover, decisions made solely by AI were generally found to underperform compared to those involving human judgment, either with or without AI input. Notably, a substantial disparity was identified in AI-alone decisions, where a higher false positive rate was observed for non-white arrestees in comparison to their white counterparts.

Implications of the Study

The outcomes of this research have both theoretical and practical significance. Theoretically, it highlights the intricate dynamics between human decision-makers and AI-based recommendations, challenging the assumption that AI integration naturally enhances decision accuracy. Practically, the findings signal to policymakers and practitioners the need for a cautious approach toward implementing AI in sensitive decision-making arenas like the judicial system. By revealing specific shortcomings in AI recommendations—particularly around racial disparities—the paper underscores the urgency for rigorous, context-specific evaluations before widespread deployment.

Future Directions in AI and Human Decision-Making Research

Looking forward, this paper lays a foundation for subsequent research paths that could explore various dimensions of AI-assisted decision-making. One potential avenue is extending the proposed methodological framework to non-binary decision-making settings, thereby expanding its applicability. Investigating the joint potential outcomes, rather than focusing solely on the baseline potential outcome, could also yield deeper insights into the nuanced impacts of AI on decision quality. Dynamic settings, where decisions and outcomes evolve over time, offer another rich context for future exploration. Lastly, the practical deployment of AI decision-making systems across different sectors presents an ongoing opportunity to refine and validate the framework introduced in this paper.

Conclusion

This research provides a methodologically robust, empirically grounded critique of the integration of AI recommendations into human decision-making processes, particularly within the judicial context. By systematically examining the influence of AI on human judgment through a carefully designed RCT, the paper offers valuable insights into the limitations and potential risks associated with AI assistance. It serves as a crucial reminder of the need for comprehensive evaluation and cautious implementation of AI technologies in decision-making processes that significantly affect human lives.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Albright, A. (2019). If you give a judge a risk score: evidence from kentucky bail decisions. Law, Economics, and Business Fellows’ Discussion Paper Series 85.
  2. Algorithmic recommendations and human discretion. Technical report, National Bureau of Economic Research.
  3. Measuring Racial Discrimination in Algorithms. AEA Papers and Proceedings 111, 49–54.
  4. Measuring Racial Discrimination in Bail Decisions. American Economic Review 112(9), 2992–3038.
  5. Fairness in machine learning. Nips tutorial 1, 2017.
  6. Policy learning with asymmetric counterfactual utilities. Journal of the American Statistical Association, Forthcoming.
  7. Berk, R. (2017). An impact assessment of machine learning risk forecasts on parole board decisions and recidivism. Journal of Experimental Criminology 13, 193–216.
  8. Forecasting domestic violence: A machine learning approach to help inform arraignment decisions. Journal of empirical legal studies 13(1), 94–115.
  9. ’it’s reducing a human being to a percentage’ perceptions of justice in algorithmic decisions. In Proceedings of the 2018 Chi conference on human factors in computing systems, pp.  1–14.
  10. ” hello ai”: uncovering the onboarding needs of medical practitioners for human-ai collaborative decision-making. Proceedings of the ACM on Human-computer Interaction 3(CSCW), 1–24.
  11. Learning under selective labels with data from heterogeneous decision-makers: An instrumental variable approach. CoRR.
  12. Heterogeneity in algorithm-assisted decision-making: A case study in child abuse hotline screening. Proceedings of the ACM on Human-Computer Interaction 6(CSCW2), 1–33.
  13. A snapshot of the frontiers of fairness in machine learning. Communications of the ACM 63(5), 82–89.
  14. The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv preprint arXiv:1808.00023.
  15. Counterfactual risk assessments, evaluation, and fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency, pp.  582–593.
  16. Characterizing fairness over the set of good models under selective labels. In International Conference on Machine Learning, pp. 2144–2155. PMLR.
  17. The effects of pre-trial detention on conviction, future crime, and employment: Evidence from randomly assigned judges. American Economic Review 108(2), 201–240.
  18. Personalized risk assessments in the criminal justice system. American Economic Review 106(5), 119–123.
  19. Randomized control trial evaluation of the implementation of the psa-dmf system in dane county. Technical report, Access to Justice Lab, Harvard Law School.
  20. Ground (less) truth: A causal framework for proxy labels in human-algorithm decision-making. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, pp.  688–704.
  21. Discretion in hiring. The Quarterly Journal of Economics 133(2), 765–800.
  22. Principal fairness for human and algorithmic decision-making. Statistical Science 38(2), 317–328.
  23. Experimental evaluation of algorithm-assisted human decision-making: Application to pretrial public safety assessment. Journal of the Royal Statistical Society Series A: Statistics in Society 186(2), 167–189.
  24. Replication data for: Experimental evaluation of algorithm-assisted human decision-making: Application to pretrial public safety assessment. Harvard Dataverse, DOI: 10.7910/DVN/L0NHQU.
  25. Human decisions and machine predictions. The quarterly journal of economics 133(1), 237–293.
  26. Human-ai collaboration in healthcare: A review and research agenda.
  27. The selective labels problem: Evaluating algorithmic predictions in the presence of unobservables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.  275–284.
  28. Manski, C. F. (2007). Identification for Prediction and Decision. Cambridge, MA: Harvard University Press.
  29. Practitioner compliance with risk/needs assessment tools: A theoretical and empirical assessment. Criminal Justice and Behavior 40(7), 716–736.
  30. Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application 8, 141–163.
  31. Neyman, J. (1923). On the application of probability theory to agricultural experiments. essay on principles. Ann. Agricultural Sciences, 1–51.
  32. Rambachan, A. (2021). Identifying prediction mistakes in observational data.
  33. Counterfactual risk assessments under unmeasured confounding. arXiv preprint arXiv:2212.09844.
  34. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66(5), 688.
  35. Impact of risk assessment on judges’ fairness in sentencing relatively poor defendants. Law and human behavior 44(1), 51.
  36. Stevenson, M. T. and J. L. Doleac (2022). Algorithmic risk assessment in the hands of humans. Available at SSRN 3489440.
  37. Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society Series B: Statistical Methodology 80(3), 531–550.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Eli Ben-Michael (28 papers)
  2. D. James Greiner (4 papers)
  3. Melody Huang (11 papers)
  4. Kosuke Imai (53 papers)
  5. Zhichao Jiang (21 papers)
  6. Sooahn Shin (4 papers)