Dice Question Streamline Icon: https://streamlinehq.com

Sample size requirements for machine learning versus regression to achieve comparable stability

Determine whether tree-based machine learning methods, such as random forests, require substantially larger development sample sizes than penalised or unpenalised logistic regression to achieve comparable stability of individual-level risk estimates, and quantify the extent of any sample size differences.

Information Square Streamline Icon: https://streamlinehq.com

Background

The authors’ approach is grounded in maximum likelihood theory for logistic regression and yields closed-form uncertainty estimates. They compare empirical uncertainty intervals from random forests and logistic regression in supplementary analyses and observe wider intervals and potential miscalibration for random forests under default settings.

They hypothesize that non-regression machine learning methods may be more data hungry for stability but explicitly state that this needs substantiation through further research.

References

Sample size for other machine learning approaches, such as tree-based methods, may need substantially higher sample sizes to achieve the same level of stability compared to (penalised) regression approaches. Further research is needed to substantiate this, but an initial investigation is provided in supplementary material S5 for our two examples.

A decomposition of Fisher's information to inform sample size for developing fair and precise clinical prediction models -- part 1: binary outcomes (2407.09293 - Riley et al., 12 Jul 2024) in Section 6 (Discussion); see also Supplementary Material S5