Weak Baselines and Reporting Biases Lead to Overoptimism in Machine Learning for Fluid-Related PDEs
The paper under review, authored by McGreivy and Hakim, provides a critical analysis of the current literature on using ML for solving partial differential equations (PDEs) related to fluid mechanics. The primary focus is on identifying the causes of overoptimistic results in this domain, specifically citing weak baselines and reporting biases as major concerns.
Main Points
Reproducibility Crisis in ML-Based Science
The paper begins by situating its discussion within the broader reproducibility crisis affecting many scientific fields. It notes that ML and ML-based scientific research are not immune to these issues. This is further corroborated by large-scale analyses documenting reproducibility concerns in various subfields of ML, such as medical applications.
Scope and Methodology
The authors focus on fluid-related PDEs, analyzing research that employs ML to solve these equations more efficiently than traditional numerical methods. They consider 82 articles and identify common pitfalls like weak baselines and reporting biases.
Weak Baselines
Two primary rules are established for ensuring fair comparisons between ML-based solvers and traditional numerical methods:
- Rule 1: Comparisons at Equal Accuracy or Equal Runtime: Comparing efficiency (or speed) only makes sense if both methods have the same accuracy. Violations occur when a high-accuracy traditional method is compared with a less accurate ML approach.
- Rule 2: Compare to an Efficient Numerical Method: It's crucial to compare ML-based solvers to state-of-the-art, highly efficient numerical methods, rather than older or less efficient ones.
The paper reports that 79% of the reviewed studies violated at least one of these rules. For instance, many articles compared their ML-based solvers against outdated or suboptimal numerical baselines, thus overestimating the advantage of the ML approach.
Reporting Biases
The analysis also identifies a pervasive bias towards reporting positive results. The paper finds that 94.8% of the reviewed articles only mentioned positive results, while only 5.2% mentioned both positive and negative results. None of the articles reported solely negative outcomes. This imbalance suggests a significant publication bias, where negative results are either discouraged or omitted, leading to an inflated perception of ML tools' effectiveness.
Statistical and Anecdotal Evidence
The paper presents both statistical and anecdotal evidence to support its claims. For example, some ML methods which show promising results in one context perform poorly when tested under different conditions. This is indicative of selective reporting and outcome switching—practices that further contribute to the reproducibility crisis.
Reproducing Results with Stronger Baselines
The authors attempted to replicate results from ten highly cited articles using stronger baselines. In most cases, the more efficient numerical methods outperformed the ML-based solvers. Notably, only three out of ten ML-based methods remained competitive when compared against these optimized traditional methods.
Implications and Future Directions
Practical Implications
The findings caution against overly optimistic assessments of ML for fluid-related PDEs. The tendency to use weak baselines and report biased results misguides subsequent research and applications, potentially leading to suboptimal solutions in real-world scenarios.
Theoretical Implications
The paper highlights the need for more rigorous standards in ML-based scientific research. Ensuring fair comparisons and complete reporting would help provide a more accurate picture of an ML model's true efficacy.
Recommendations for Best Practices
- Fair Comparisons: Compare ML-based solvers with both traditional numerical methods and other ML-based methods.
- Adherence to Rule 1: Always ensure comparisons are made at equal accuracy or runtime.
- Multiple Baselines: Employ multiple numerical methods where possible to ensure robust baselines.
- Transparency: Explicitly discuss how baselines were chosen and justify their efficiency.
- Report Efficiency Metrics: Besides runtime and accuracy, include metrics like computational cost to generate training data and train models.
Conclusion
The paper provides a nuanced and empirical critique of current ML practices in solving fluid-related PDEs. By identifying and quantifying issues around weak baselines and reporting biases, it sets a high bar for future research methodologies in this domain. The recommendations made by McGreivy and Hakim aim to foster a more rigorous and transparent approach, ultimately benefitting both the ML community and the broader scientific landscape.