Learning to Complement with Multiple Humans (2311.13172v2)
Abstract: Real-world image classification tasks tend to be complex, where expert labellers are sometimes unsure about the classes present in the images, leading to the issue of learning with noisy labels (LNL). The ill-posedness of the LNL task requires the adoption of strong assumptions or the use of multiple noisy labels per training image, resulting in accurate models that work well in isolation but fail to optimise human-AI collaborative classification (HAI-CC). Unlike such LNL methods, HAI-CC aims to leverage the synergies between human expertise and AI capabilities but requires clean training labels, limiting its real-world applicability. This paper addresses this gap by introducing the innovative Learning to Complement with Multiple Humans (LECOMH) approach. LECOMH is designed to learn from noisy labels without depending on clean labels, simultaneously maximising collaborative accuracy while minimising the cost of human collaboration, measured by the number of human expert annotations required per image. Additionally, new benchmarks featuring multiple noisy labels for both training and testing are proposed to evaluate HAI-CC methods. Through quantitative comparisons on these benchmarks, LECOMH consistently outperforms competitive HAI-CC approaches, human labellers, multi-rater learning, and noisy-label learning methods across various datasets, offering a promising solution for addressing real-world image classification challenges.
- Z. Gao, F.-K. Sun, M. Yang, S. Ren, Z. Xiong, M. Engeler, A. Burazer, L. Wildling, L. Daniel, and D. S. Boning, “Learning from multiple annotator noisy labels via sample-wise label fusion,” in European Conference on Computer Vision. Springer, 2022, pp. 407–422.
- Y. Chen, F. Liu, H. Wang, C. Wang, Y. Liu, Y. Tian, and G. Carneiro, “Bomd: bag of multi-label descriptors for noisy chest x-ray classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 21 284–21 295.
- Q. Wei, L. Feng, H. Sun, R. Wang, C. Guo, and Y. Yin, “Fine-grained classification with noisy labels,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 651–11 660.
- E. Kamar, S. Hacker, and E. Horvitz, “Combining human and machine intelligence in large-scale crowdsourcing.” in International Conference on Autonomous Agents and Multiagent Systems, vol. 12, 2012, pp. 467–474.
- H. Song, M. Kim, D. Park, Y. Shin, and J.-G. Lee, “Learning from noisy labels with deep neural networks: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
- W. Ji, S. Yu, J. Wu, K. Ma, C. Bian, Q. Bi, J. Li, H. Liu, L. Cheng, and Y. Zheng, “Learning calibrated medical image segmentation via multi-rater agreement modeling,” in IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 341–12 351.
- Y. Liu, H. Cheng, and K. Zhang, “Identifiability of label noise transition matrix,” in International Conference on Machine Learning. PMLR, 2023, pp. 21 475–21 496.
- A. Dafoe, Y. Bachrach, G. Hadfield, E. Horvitz, K. Larson, and T. Graepel, “Cooperative AI: Machines must learn to find common ground,” Nature, vol. 593, no. 7857, pp. 33–36, 2021.
- P. Hemmer, S. Schellhammer, M. Vössing, J. Jakubik, and G. Satzger, “Forming effective human-AI teams: Building machine learning models that complement the capabilities of multiple experts,” in International Joint Conference on Artificial Intelligence, 7 2022, pp. 2478–2484.
- R. Verma, D. Barrejon, and E. Nalisnick, “Learning to defer to multiple experts: Consistent surrogate losses, confidence calibration, and conformal ensembles,” in International Conference on Artificial Intelligence and Statistics. PMLR, 25–27 Apr 2023, pp. 11 415–11 434.
- H. Mozannar, H. Lang, D. Wei, P. Sattigeri, S. Das, and D. Sontag, “Who should predict? Exact algorithms for learning to defer to humans,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2023, pp. 10 520–10 545.
- J. Li, R. Socher, and S. C. Hoi, “DivideMix: Learning with noisy labels as semi-supervised learning,” in International Conference on Learning Representations, 2020.
- L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “Mentornet: Learning data-driven curriculum for very deep neural networks on corrupted labels,” in International Conference on Machine Learning, 2018.
- E. Arazo, D. Ortego, P. Albert, N. O’Connor, and K. Mcguinness, “Unsupervised label noise modeling and loss correction,” in International Conference on Machine Learning, 2019, pp. 312–321.
- Z. Zhu, Y. Song, and Y. Liu, “Clusterability as an alternative to anchor points when learning with noisy labels,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 912–12 923.
- Z. Zhang and M. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” in Advances in Neural Information Processing Systems, 2018, pp. 8778–8788.
- A. Ghosh, H. Kumar, and P. S. Sastry, “Robust loss functions under label noise for deep neural networks,” in AAAI Conference on Artificial Intelligence, vol. 31, 2017.
- B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and M. Sugiyama, “Co-teaching: Robust training of deep neural networks with extremely noisy labels,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
- B. Yuan, J. Chen, W. Zhang, H.-S. Tai, and S. McMains, “Iterative cross learning on noisy labels,” in Winter Conference on Applications of Computer Vision, 2018, pp. 757–765.
- L. Jaehwan, Y. Donggeun, and K. Hyo-Eun, “Photometric transformer networks and label adjustment for breast density prediction,” in International Conference on Computer Vision Workshops, 2019.
- D. Ortego, E. Arazo, P. Albert, N. E. O’Connor, and K. McGuinness, “Multi-objective interpolation training for robustness to label noise,” in IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 6606–6615.
- P. Chen, J. Ye, G. Chen, J. Zhao, and P.-A. Heng, “Beyond class-conditional assumption: A primary attempt to combat instance-dependent label noise,” in AAAI Conference on Artificial Intelligence, 2021.
- M. Ren, W. Zeng, B. Yang, and R. Urtasun, “Learning to reweight examples for robust deep learning,” in International Conference on Machine Learning, 2018.
- Z. Zhang, H. Zhang, S. Arik, H. Lee, and T. Pfister, “Distilling effective supervision from severe label noise,” in IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 9291–9300.
- Z. Zhang and T. Pfister, “Learning fast sample re-weighting without reward data,” in International Conference on Computer Vision, 2021.
- Y. Xu, L. Zhu, L. Jiang, and Y. Yang, “Faster meta update strategy for noise-robust deep learning,” in CVPR, 2021.
- A. Garg, C. Nguyen, R. Felix, T.-T. Do, and G. Carneiro, “Instance-dependent noisy label learning via graphical modelling,” in IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 2288–2298.
- F. Rodrigues, M. Lourenco, B. Ribeiro, and F. C. Pereira, “Learning supervised topic models for classification and regression from crowds,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2409–2422, 2017.
- A. P. Dawid and A. M. Skene, “Maximum likelihood estimation of observer error-rates using the em algorithm,” Journal of the Royal Statistical Society: Series C (Applied Statistics), vol. 28, no. 1, pp. 20–28, 1979.
- J. Whitehill, T.-f. Wu, J. Bergsma, J. Movellan, and P. Ruvolo, “Whose vote should count more: Optimal integration of labels from labelers of unknown expertise,” in Advances in Neural Information Processing Systems, vol. 22, 2009.
- V. C. Raykar, S. Yu, L. H. Zhao, G. H. Valadez, C. Florin, L. Bogoni, and L. Moy, “Learning from crowds,” Journal of Machine Learning Research, vol. 11, no. 4, 2010.
- F. Rodrigues and F. Pereira, “Deep learning from crowds,” in AAAI Conference on Artificial Intelligence, vol. 32, 2018.
- M. Herde, D. Huseljic, and B. Sick, “Multi-annotator deep learning: A probabilistic framework for classification,” Transactions on Machine Learning Research, 2023.
- C. Rastogi, L. Leqi, K. Holstein, and H. Heidari, “A taxonomy of human and ml strengths in decision-making to investigate human-ml complementarity,” in AAAI Conference on Human Computation and Crowdsourcing, vol. 11, 2023, pp. 127–139.
- V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy, “Supervised learning from multiple experts: whom to trust when everyone lies a bit,” in International Conference on Machine Learning, 2009, pp. 889–896.
- M. Guan, V. Gulshan, A. Dai, and G. Hinton, “Who said what: Modeling individual labelers improves classification,” in AAAI Conference on Artificial Intelligence, vol. 32, 2018.
- Z. Mirikharaji, K. Abhishek, S. Izadi, and G. Hamarneh, “D-LEMA: Deep learning ensembles from multiple annotations-application to skin lesion segmentation,” in IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 1837–1846.
- A. Khetan, Z. C. Lipton, and A. Anandkumar, “Learning from noisy singly-labeled data,” arXiv preprint arXiv:1712.04577, 2017.
- R. Tanno, A. Saeedi, S. Sankaranarayanan, D. C. Alexander, and N. Silberman, “Learning from noisy labels by regularized estimation of annotator confusion,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 11 244–11 253.
- J. Wu, H. Fang, Z. Wang, D. Yang, Y. Yang, F. Shang, W. Zhou, and Y. Xu, “Learning self-calibrated optic disc and cup segmentation from multi-rater annotations,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 614–624.
- Z. Cao, E. Chen, Y. Huang, S. Shen, and Z. Huang, “Learning from crowds with annotation reliability,” in International Conference on Research and Development in Information Retrieval, 2023, pp. 2103–2107.
- H. Wei, R. Xie, L. Feng, B. Han, and B. An, “Deep learning from multiple noisy annotators as a union,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Y. Zhao, G. Zheng, S. Mukherjee, R. McCann, and A. Awadallah, “Admoe: Anomaly detection with mixture-of-experts from noisy labels,” in AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 4937–4945.
- H. W. Goh, U. Tkachenko, and J. Mueller, “CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators,” in NeurIPS Human in the Loop Learning Workshop, 2022.
- A. Rosenfeld, M. D. Solbach, and J. K. Tsotsos, “Totally looks like-how humans compare, compared to machines,” in CVPRW, 2018, pp. 1961–1964.
- T. Serre, “Deep learning: the good, the bad, and the ugly,” Annual Review of Vision Science, vol. 5, pp. 399–426, 2019.
- E. K. Chiou and J. D. Lee, “Trusting automation: Designing for responsivity and resilience,” Human Factors, vol. 65, no. 1, pp. 137–165, 2023.
- Z. Lu and M. Yin, “Human reliance on machine learning models when performance feedback is limited: Heuristics and risks,” in CHI Conference on Human Factors in Computing Systems, 2021, pp. 1–16.
- M. Yin, J. Wortman Vaughan, and H. Wallach, “Understanding the effect of accuracy on trust in machine learning models,” in CHI Conference on Human Factors in Computing Systems, 2019, pp. 1–12.
- D. Shin, “The effects of explainability and causability on perception, trust, and acceptance: Implications for explainable AI,” International Journal of Human-Computer Studies, vol. 146, p. 102551, 2021.
- K. Weitz, D. Schiller, R. Schlagowski, T. Huber, and E. André, “"do you trust me?" increasing user-trust by integrating virtual agents in explainable ai interaction design,” in ACM International Conference on Intelligent Virtual Agents, 2019, pp. 7–9.
- G. Bansal, B. Nushi, E. Kamar, E. Horvitz, and D. S. Weld, “Is the most accurate AI the best teammate? Optimizing AI for teamwork,” in AAAI Conference on Artificial Intelligence, vol. 35, 2021, pp. 11 405–11 414.
- N. Agarwal, A. Moehring, P. Rajpurkar, and T. Salz, “Combining human expertise with artificial intelligence: experimental evidence from radiology,” National Bureau of Economic Research, Tech. Rep., 2023.
- K. Vodrahalli, R. Daneshjou, T. Gerstenberg, and J. Zou, “Do humans trust advice more if it comes from ai? an analysis of human-AI interactions,” in AAAI/ACM Conference on AI, Ethics, and Society, 2022, pp. 763–777.
- X. Wu, L. Xiao, Y. Sun, J. Zhang, T. Ma, and L. He, “A survey of human-in-the-loop for machine learning,” Future Generation Computer Systems, vol. 135, no. C, p. 364–381, oct 2022.
- B. Wilder, E. Horvitz, and E. Kamar, “Learning to complement humans,” in International Joint Conference on Artificial Intelligence, 2021.
- D. Madras, T. Pitassi, and R. Zemel, “Predict responsibly: improving fairness and accuracy by learning to defer,” in Advances in Neural Information Processing Systems, vol. 31, 2018.
- V. Keswani, M. Lease, and K. Kenthapadi, “Towards unbiased and accurate deferral to multiple experts,” in AAAI/ACM Conference on AI, Ethics, and Society, 2021, pp. 154–165.
- H. Narasimhan, W. Jitkrittum, A. K. Menon, A. Rawat, and S. Kumar, “Post-hoc estimators for learning to defer to an expert,” in Advances in Neural Information Processing Systems, vol. 35, 2022, pp. 29 292–29 304.
- A. Mao, C. Mohri, M. Mohri, and Y. Zhong, “Two-stage learning to defer with multiple experts,” in Advances in Neural Information Processing Systems, 2023.
- C. Cortes, G. DeSalvo, and M. Mohri, “Learning with rejection,” in International Conference on Algorithmic Learning Theory. Springer, 2016, pp. 67–82.
- N. Charoenphakdee, Z. Cui, Y. Zhang, and M. Sugiyama, “Classification with rejection based on cost-sensitive classification,” in International Conference on Machine Learning. PMLR, 2021, pp. 1507–1517.
- M. Raghu, K. Blumer, G. Corrado, J. Kleinberg, Z. Obermeyer, and S. Mullainathan, “The algorithmic automation problem: Prediction, triage, and human effort,” arXiv preprint arXiv:1903.12220, 2019.
- N. Okati, A. De, and M. Rodriguez, “Differentiable learning under triage,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 9140–9151.
- H. Mozannar and D. Sontag, “Consistent estimators for learning to defer to an expert,” in International Conference on Machine Learning. PMLR, 2020, pp. 7076–7087.
- R. Verma and E. Nalisnick, “Calibrated learning to defer with one-vs-all classifiers,” in International Conference on Machine Learning. PMLR, 2022, pp. 22 184–22 202.
- M.-A. Charusaie, H. Mozannar, D. Sontag, and S. Samadi, “Sample efficient learning of predictors that complement humans,” in International Conference on Machine Learning. PMLR, 2022, pp. 2972–3005.
- Y. Cao, H. Mozannar, L. Feng, H. Wei, and B. An, “In defense of softmax parametrization for calibrated and consistent learning to defer,” in Advances in Neural Information Processing Systems, vol. 36, 2024.
- E. Straitouri, L. Wang, N. Okati, and M. G. Rodriguez, “Improving expert predictions with conformal prediction,” in International Conference on Machine Learning. PMLR, 2023, pp. 32 633–32 653.
- S. Liu, Y. Cao, Q. Zhang, L. Feng, and B. An, “Mitigating underfitting in learning to defer with consistent losses,” in International Conference on Artificial Intelligence and Statistics, 2024.
- H. Mozannar, A. Satyanarayan, and D. Sontag, “Teaching humans when to defer to a classifier via exemplars,” in AAAI Conference on Artificial Intelligence, vol. 36, 2022, pp. 5323–5331.
- R. Verma, D. Barrejón, and E. Nalisnick, “On the calibration of learning to defer to multiple experts,” in ICML Workshop on Human-Machine Collaboration and Teaming, 2022.
- V. Babbar, U. Bhatt, and A. Weller, “On the utility of prediction sets in human-AI teams,” in International Joint Conference on Artificial Intelligence, 2022.
- A. Mao, M. Mohri, and Y. Zhong, “Principled approaches for learning to defer with multiple experts,” in International Symposium on Artificial Intelligence and Mathematics, 2024.
- P. Hemmer, L. Thede, M. Vössing, J. Jakubik, and N. Kühl, “Learning to defer with limited expert predictions,” in AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 6002–6011.
- D. Tailor, A. Patra, R. Verma, P. Manggala, and E. Nalisnick, “Learning to defer to a population: A meta-learning approach,” in International Conference on Artificial Intelligence and Statistics, 2024.
- D. Leitão, P. Saleiro, M. A. Figueiredo, and P. Bizarro, “Human-AI collaboration in decision-making: Beyond learning to defer,” in ICML Workshop on Human-Machine Collaboration and Teaming, 2022.
- M. Steyvers, H. Tejeda, G. Kerrigan, and P. Smyth, “Bayesian modeling of human–AI complementarity,” National Academy of Sciences, vol. 119, no. 11, p. e2111547119, 2022.
- G. Kerrigan, P. Smyth, and M. Steyvers, “Combining human predictions with model probabilities via confusion matrices and calibration,” in Advances in Neural Information Processing Systems, vol. 34, 2021, pp. 4421–4434.
- M. Liu, J. Wei, Y. Liu, and J. Davis, “Do humans and machines have the same eyes? Human-machine perceptual differences on image classification,” arXiv preprint arXiv:2304.08733, 2023.
- H. Wang, R. Xiao, Y. Dong, L. Feng, and J. Zhao, “ProMix: combating label noise via maximizing clean sample utility,” in International Joint Conference on Artificial Intelligence, 2023.
- C. Zhu, W. Chen, T. Peng, Y. Wang, and M. Jin, “Hard sample aware noise robust learning for histopathology image classification,” IEEE Transactions on Medical Imaging, vol. 41, no. 4, pp. 881–894, 2021.
- F. Liu, Y. Chen, Y. Tian, Y. Liu, C. Wang, V. Belagiannis, and G. Carneiro, “Nvum: Non-volatile unbiased memory for robust medical image classification,” in International Conference on Medical Image Computing and Computer-Assisted Intervention. Springer, 2022, pp. 544–553.
- C. J. Maddison, A. Mnih, and Y. W. Teh, “The Concrete distribution: A continuous relaxation of discrete random variables,” in International Conference on Learning Representations, 2017.
- E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with Gumbel-softmax,” in International Conference on Learning Representations, 2017.
- A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” University of Toronto, Tech. Rep., 2009.
- J. Wei, Z. Zhu, H. Cheng, T. Liu, G. Niu, and Y. Liu, “Learning with noisy labels revisited: A study using real-world human annotations,” in International Conference on Learning Representations, 2021.
- J. C. Peterson, R. M. Battleday, T. L. Griffiths, and O. Russakovsky, “Human uncertainty makes classification more robust,” in International Conference on Computer Vision, 2019, pp. 9617–9626.
- X. Xia, T. Liu, B. Han, M. Gong, J. Yu, G. Niu, and M. Sugiyama, “Sample selection with uncertainty of losses for learning with noisy labels,” in International Conference on Learning Representations, 2022.
- A. Majkowska, S. Mittal, D. F. Steiner, J. J. Reicher, S. M. McKinney, G. E. Duggan, K. Eswaran, P.-H. Cameron Chen, Y. Liu, S. R. Kalidindi, A. Ding, G. S. Corrado, D. Tse, and S. Shetty, “Chest radiograph interpretation with deep learning models: assessment with radiologist-adjudicated reference standards and population-adjusted evaluation,” Radiology, vol. 294, no. 2, pp. 421–431, 2020.
- X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “ChestX-ray8: Hospital-scale chestX-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2097–2106.
- A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “PyTorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, vol. 32, 2019.