Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Outcome-based Evaluation of Systematic Review Automation (2306.17614v1)

Published 30 Jun 2023 in cs.IR

Abstract: Current methods of evaluating search strategies and automated citation screening for systematic literature reviews typically rely on counting the number of relevant and not relevant publications. This established practice, however, does not accurately reflect the reality of conducting a systematic review, because not all included publications have the same influence on the final outcome of the systematic review. More specifically, if an important publication gets excluded or included, this might significantly change the overall review outcome, while not including or excluding less influential studies may only have a limited impact. However, in terms of evaluation measures, all inclusion and exclusion decisions are treated equally and, therefore, failing to retrieve publications with little to no impact on the review outcome leads to the same decrease in recall as failing to retrieve crucial publications. We propose a new evaluation framework that takes into account the impact of the reported study on the overall systematic review outcome. We demonstrate the framework by extracting review meta-analysis data and estimating outcome effects using predictions from ranking runs on systematic reviews of interventions from CLEF TAR 2019 shared task. We further measure how closely the obtained outcomes are to the outcomes of the original review if the arbitrary rankings were used. We evaluate 74 runs using the proposed framework and compare the results with those obtained using standard IR measures. We find that accounting for the difference in review outcomes leads to a different assessment of the quality of a system than if traditional evaluation measures were used. Our analysis provides new insights into the evaluation of retrieval results in the context of systematic review automation, emphasising the importance of assessing the usefulness of each document beyond binary relevance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error. Systematic Reviews 2019 8:1 8, 1 (1 2019), 1–12. https://doi.org/10.1186/S13643-019-0942-7
  2. On the use of systematic reviews to inform environmental policies. Environmental Science & Policy 42 (2014), 67–77.
  3. Stefanie Castillo and Petar Grbovic. 2022. The APISSER Methodology for Systematic Literature Reviews in Engineering. IEEE Access 10 (2022), 23700–23707.
  4. A prospective evaluation of an automated classification system to support evidence-based medicine and systematic review. In AMIA annual symposium proceedings, Vol. 2010. American Medical Informatics Association, 121.
  5. Reducing workload in systematic review preparation using automated citation classification. Journal of the American Medical Informatics Association 13, 2 (3 2006), 206–219. https://doi.org/10.1197/jamia.M1929
  6. A pilot study using machine learning and domain knowledge to facilitate comparative effectiveness review updating. Medical Decision Making 33, 3 (2013), 343–355.
  7. Analysing data and undertaking meta-analyses. Cochrane handbook for systematic reviews of interventions (2019), 241–284.
  8. Jonathan J. Deeks and Julian P. T. Higgins. 2010. Statistical algorithms in Review Manager 5. Statistical Methods Group of The Cochrane Collaboration 1, 11 (2010).
  9. TREC 2016 Total Recall Track Overview.. In TREC.
  10. Choosing effect measures and computing estimates of effect. Cochrane handbook for systematic reviews of interventions (2019), 143–176.
  11. SWIFT-Review: A text-mining workbench for systematic review. Systematic Reviews 5, 1 (5 2016), 1–16. https://doi.org/10.1186/s13643-016-0263-z
  12. SWIFT-Active Screener: Accelerated document screening through active learning and integrated recall estimation. Environment International 138 (5 2020), 105623. https://doi.org/10.1016/J.ENVINT.2020.105623
  13. Systematic Reviews: CRD’s guidance for undertaking reviews in health care. CRD, University of York, York. www.york.ac.uk/inst/crd
  14. CLEF 2017 technologically assisted reviews in empirical medicine overview. CEUR Workshop Proceedings 1866 (9 2017), 1–29. https://pureportal.strath.ac.uk/en/publications/clef-2017-technologically-assisted-reviews-in-empirical-medicine-
  15. CLEF 2018 technologically assisted reviews in empirical medicine overview. CEUR Workshop Proceedings 2125 (7 2018). https://pureportal.strath.ac.uk/en/publications/clef-2018-technologically-assisted-reviews-in-empirical-medicine-
  16. CLEF 2019 Technology Assisted Reviews in Empirical Medicine Overview. In CLEF.
  17. Staffs Keele et al. 2007. Guidelines for performing systematic literature reviews in software engineering. Technical Report. Technical report, ver. 2.3 ebse technical report. ebse.
  18. Using a neural network-based feature extraction method to facilitate citation screening for systematic reviews. Expert Systems with Applications: X 6 (7 2020), 100030. https://doi.org/10.1016/j.eswax.2020.100030
  19. Automation of Citation Screening for Systematic Literature Reviews Using Neural Networks: A Replicability Study. In Advances in Information Retrieval, Matthias Hagen, Suzan Verberne, Craig Macdonald, Christin Seifert, Krisztian Balog, Kjetil Nørvåg, and Vinay Setty (Eds.). Springer International Publishing, Cham, 584–598. https://arxiv.org/abs/2201.07534v1
  20. An Analysis of Work Saved over Sampling in the Evaluation of Automated Citation Screening in Systematic Literature Reviews. Intelligent Systems with Applications 18 (2023), 200193. https://doi.org/10.1016/j.iswa.2023.200193
  21. VoMBaT: A Tool for Visualising Evaluation Measure Behaviour in High-Recall Search Tasks. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’23). ACM, Taipei, Taiwan, 5. https://doi.org/10.1145/3539618.3591802
  22. Alexander V Lotov and Kaisa Miettinen. 2008. Visualizing the Pareto Frontier. Multiobjective optimization 5252 (2008), 213–243.
  23. Walid Magdy and Gareth JF Jones. 2010. PRES: a score metric for evaluating recall-oriented information retrieval applications. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. 611–618.
  24. A new algorithm for reducing the workload of experts in performing systematic reviews. Journal of the American Medical Informatics Association 17, 4 (7 2010), 446–453. https://doi.org/10.1136/JAMIA.2010.004325
  25. Excluding non-English publications from evidence-syntheses did not change conclusions: a meta-epidemiological study. Journal of clinical epidemiology 118 (2020), 42–54.
  26. Abbreviated literature searches were viable alternatives to comprehensive searches: a meta-epidemiological study. Journal of Clinical Epidemiology 102 (2018), 1–11. https://doi.org/10.1016/j.jclinepi.2018.05.022
  27. Using text mining for study identification in systematic reviews: a systematic review of current approaches. Systematic reviews 4, 1 (2015), 1–22.
  28. Mark Petticrew and Helen Roberts. 2008. Systematic reviews in the social sciences: A practical guide. John Wiley & Sons.
  29. You can teach an old dog new tricks: Rank fusion applied to coordination level matching for ranking in systematic reviews. In Advances in Information Retrieval: 42nd European Conference on IR Research, ECIR 2020, Lisbon, Portugal, April 14–17, 2020, Proceedings, Part I 42. Springer, 399–414.
  30. A comparison of automatic Boolean query formulation for systematic reviews. Information Retrieval Journal 24, 1 (2021), 3–28.
  31. Automatic Boolean Query Formulation for Systematic Review Literature Search. In The Web Conference 2020 - Proceedings of the World Wide Web Conference, WWW 2020. ACM, New York, NY, USA, 1071–1081. https://doi.org/10.1145/3366423.3380185
  32. A test collection for evaluating retrieval of studies for inclusion in systematic reviews. SIGIR 2017 - Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (8 2017), 1237–1240. https://doi.org/10.1145/3077136.3080707
  33. Automated screening of research studies for systematic reviews using study characteristics. Systematic Reviews 2018 7:1 7, 1 (4 2018), 1–9. https://doi.org/10.1186/S13643-018-0724-7
  34. Automation of systematic literature reviews: A systematic literature review. Information and Software Technology 136 (8 2021), 106589. https://doi.org/10.1016/j.infsof.2021.106589
  35. Active learning for biomedical citation screening. Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2010), 173–181. https://doi.org/10.1145/1835804.1835829
  36. Who should label what? instance allocation in multiple expert active learning. In Proceedings of the 2011 SIAM international conference on data mining. SIAM, 176–187.
  37. Semi-automated screening of biomedical citations for systematic reviews. BMC bioinformatics 11, 1 (2010), 1–11.
  38. Automated MeSH Term Suggestion for Effective Query Formulation in Systematic Reviews Literature Search. Intelligent Systems with Applications (2022), 200141.
  39. Can ChatGPT write a good boolean query for systematic review literature search? arXiv preprint arXiv:2302.03495 (2023).
  40. Neural Rankers for Effective Screening Prioritisation in Medical Systematic Review Literature Search. In Proceedings of the 26th Australasian Document Computing Symposium (Adelaide, SA, Australia) (ADCS ’22). Association for Computing Machinery, New York, NY, USA, Article 4, 10 pages. https://doi.org/10.1145/3572960.3572980
  41. Goldilocks: Just-right tuning of BERT for technology-assisted review. In European Conference on Information Retrieval. Springer, 502–517.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Wojciech Kusa (16 papers)
  2. Guido Zuccon (73 papers)
  3. Petr Knoth (19 papers)
  4. Allan Hanbury (45 papers)
Citations (7)