Cause of test flakiness for SWE-bench Lite instance astropy__astropy-6938
Ascertain the exact cause of the non-deterministic behavior of the unit test suite for the SWE-bench Lite instance astropy__astropy-6938 that is flaky on some machines but not others, and determine the environment and dependency conditions under which this flakiness occurs.
References
An additional instance, astropy__astropy-6938, was flaky on some machines and not others. The authors of SWE-bench were able to reproduce the flakyness; however, we were unable to. Our prelimiary investigation indicates this specific issue is due to unpinned versions of dependencies in the docker environments that run the unit tests.
— Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
(2407.21787 - Brown et al., 31 Jul 2024) in Appendix: SWE-bench Lite, Subsection “Test Suite Flakiness”