Underspecification Presents Challenges for Credibility in Modern Machine Learning (2011.03395v2)

Published 6 Nov 2020 in cs.LG and stat.ML

Abstract: ML models often exhibit unexpectedly poor behavior when they are deployed in real-world domains. We identify underspecification as a key reason for these failures. An ML pipeline is underspecified when it can return many predictors with equivalently strong held-out performance in the training domain. Underspecification is common in modern ML pipelines, such as those based on deep learning. Predictors returned by underspecified pipelines are often treated as equivalent based on their training domain performance, but we show here that such predictors can behave very differently in deployment domains. This ambiguity can lead to instability and poor model behavior in practice, and is a distinct failure mode from previously identified issues arising from structural mismatch between training and deployment domains. We show that this problem appears in a wide variety of practical ML pipelines, using examples from computer vision, medical imaging, natural language processing, clinical risk prediction based on electronic health records, and medical genomics. Our results show the need to explicitly account for underspecification in modeling pipelines that are intended for real-world deployment in any domain.

Citations (628)

View on Semantic Scholar

Summary

The paper demonstrates that underspecification leads to multiple models performing equally on training data yet behaving differently in deployment scenarios.
It presents empirical evidence across domains like computer vision, NLP, and medical imaging, highlighting significant performance variations on stress tests.
The paper recommends enforcing domain-specific inductive biases to improve model robustness and credibility without sacrificing in-domain accuracy.

Underspecification Presents Challenges for Credibility in Modern Machine Learning

The paper, "Underspecification Presents Challenges for Credibility in Modern Machine Learning," explores the notion of underspecification in ML pipelines and its implications for model performance, particularly in deployment scenarios. Underspecification arises when there are multiple predictors that yield equivalent performance on training data but differ in their behavior when applied to real-world tasks. This issue is shown to be pervasive in modern ML, particularly in deep learning applications, where the complexity of models creates a substantial equivalence class of solutions not differentiated by training data alone.

Key Findings

Nature of Underspecification: An underspecified ML pipeline can produce many equally performant models on a training distribution due to arbitrary variations such as initialization and optimization path. This phenomenon is particularly prominent in overparameterized models like neural networks, where there are more parameters than training samples.
Impact on Deployment: The paper demonstrates that predictors with similar in-sample performance can exhibit divergent behaviors in deployment. This is distinct from distribution shift challenges where training and deployment domains inherently differ. Here, the unpredictability is due to underspecification within the same deployment context.
Empirical Evidence: Through several case studies in computer vision, medical imaging, natural language processing, and electronic health records, the paper provides empirical evidence of underspecified models. For example, in computer vision, variations in model initialization resulted in significantly different performance on robustness benchmarks such as ImageNet-C, despite similar accuracy on the original dataset.
Stress Testing: The authors advocate for using stress tests as a means to evaluate out-of-distribution performance explicitly. These tests reveal dimensions along which models may not generalize, thus highlighting gaps in the learning process not apparent through standard evaluation methods.
Recommendations: To tackle underspecification, the authors suggest developing methods to enforce credible inductive biases directly relevant to specific applications. Importantly, they argue that this need not compromise indoor performance, as the credible models can exist within the equivalence class.

Implications

Practical: For practitioners, the findings imply the necessity of rigorous and application-specific stress testing to uncover weaknesses in deployed models that are not captured by standard evaluations. Models should be tested across a variety of conditions mimicking potential deployment scenarios.
Theoretical: From a theoretical standpoint, the notion of underspecification challenges the assumption that training performance is a complete proxy for model quality. It adds a layer of complexity to model selection and design, suggesting a need for frameworks that can incorporate domain-specific requirements beyond iid performance.
Future Research: The paper opens pathways for further research into methods that explicitly constrain the inductive biases of ML models to align with real-world expectations. This could involve combining causal inference methods with traditional ML approaches to ensure models learn the desired causal structures.

Conclusion

Underspecification poses a significant challenge to the credibility of ML models in real-world deployments. It requires a departure from current practices that rely heavily on training data performance, advocating instead for more comprehensive evaluation frameworks that account for model behavior under varied conditions. By addressing underspecification, ML practitioners can enhance the robustness and reliability of models, ensuring they meet the nuanced demands of their application domains.