Dice Question Streamline Icon: https://streamlinehq.com

Cause of test flakiness for SWE-bench Lite instance astropy__astropy-6938

Ascertain the exact cause of the non-deterministic behavior of the unit test suite for the SWE-bench Lite instance astropy__astropy-6938 that is flaky on some machines but not others, and determine the environment and dependency conditions under which this flakiness occurs.

Information Square Streamline Icon: https://streamlinehq.com

Background

In analyzing SWE-bench Lite, the authors identified widespread flakiness in test suites, with 34 problems exhibiting inconsistent pass/fail behavior and 30 of those affecting the dataset-provided correct solutions. One particular instance, astropy__astropy-6938, showed flakiness that varied by machine.

Although the SWE-bench authors could reproduce the issue, the paper’s authors could not, suggesting unresolved environment-specific factors. Their preliminary investigation points to unpinned dependency versions in the Docker environments used to run unit tests, but the precise cause remains to be determined.

References

An additional instance, astropy__astropy-6938, was flaky on some machines and not others. The authors of SWE-bench were able to reproduce the flakyness; however, we were unable to. Our prelimiary investigation indicates this specific issue is due to unpinned versions of dependencies in the docker environments that run the unit tests.

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling (2407.21787 - Brown et al., 31 Jul 2024) in Appendix: SWE-bench Lite, Subsection “Test Suite Flakiness”