Non-Replicable ML Research Challenges

Updated 12 December 2025

Non-replicable ML research is defined by the inability to independently confirm results due to uncontrolled randomness, poor documentation, and varying experimental conditions.
Methodological issues such as algorithmic stochasticity, hardware/software variations, and design pitfalls contribute significantly to irreproducible outcomes.
These challenges distort scientific records, erode public trust, and call for robust frameworks and best practices to improve reproducibility in ML studies.

Non-replicable ML research refers to studies where reported results cannot be independently confirmed under standardized or reasonably varied experimental conditions. The phenomenon spans computational, methodological, organizational, and epistemological dimensions. Despite focused initiatives across the scientific community, large fractions of ML studies—especially those deploying deep learning—remain irreproducible or irrecoverable, a pattern with significant implications for scientific progress, real-world deployment, and public trust.

1. Types and Formal Definitions of Replicability and Reproducibility

Contemporary ML distinguishes multiple grades of scientific validation, most precisely systematized in frameworks by Desai et al., Gundersen & Kjensmo, and Belz:

Repeatability: An experiment rerun by the original team, with identical code, data, and environment, yielding the same results.
Reproducibility (Dependent/Independent): A new team recovers the same results, either by rerunning the original implementation and data (dependent), or via re-implementation from the publication’s description (independent).
Replicability (Direct/Conceptual): New experiments vary aspects of implementation (direct) or design/protocol (conceptual) while targeting the original hypothesis. “Direct replicability” checks robustness to implementation variants; “conceptual replicability” examines the core claim’s generality rather than the exact outcome (Desai et al., 29 Apr 2024).

Additional formalizations are used in subfields. For example, in deep learning for software engineering, replicability at level α is defined so that, with repeated runs $X_1,\ldots, X_n$ , the reported score $X_0$ falls within a $(1-\alpha)$ confidence interval around the true mean $\hat{\mu}$ , corrected for run variance. Here, the coefficient of variation $c_v = \hat{\sigma} / \hat{\mu}$ quantifies result stability—a low $c_v$ indicates high replicability. Reproducibility, in contrast, is operationalized through sensitivity to data variations (e.g., test split, vocabulary, convergence) and uses metrics such as $c_v^{\text{test}}$ and $c_v^\text{vocab}$ (Liu et al., 2020).

This taxonomy is rooted in broader metrological concepts: “repeatability” (identical conditions), “reproducibility” (at least one differing condition). Belz advocates for the unbiased coefficient of variation $CV^*$ as a unitless, field-comparable measure of reproducibility (Belz, 2021).

2. Methodological and Technical Factors Driving Non-Replicability

Non-replicable ML research arises from a confluence of computational, methodological, and organizational breakdowns. A comprehensive inventory includes:

Algorithmic Stochasticity: Randomness in initialization, data shuffling, and stochastic layers (dropout, batchnorm) leads to run-to-run variability (Rivera-Landos et al., 2021, Liu et al., 2020).
Uncontrolled Software/Hardware Variants: Framework/library upgrades (e.g., PyTorch, TensorFlow, CUDA), differences in hardware (CPU vs GPU, FPU microcode), compiler settings, environmental variables, and multi-threading all introduce nondeterminism (Rivera-Landos et al., 2021, Semmelrock et al., 2023, Gundersen et al., 2022).
Poor Documentation and Artifact Availability: Missing code/domains/data, unclear experimental protocols, and insufficiently specified hyperparameters prevent reproduction (Akella et al., 2023, Semmelrock et al., 2023).
Experiment Design Pitfalls: Data leakage through improper train/test splits, metric/hyperparameter selection bias, non-standardized preprocessing, and selective reporting constitute significant threats (Gundersen et al., 2022, Semmelrock et al., 2023).
Structural and Social Barriers: Privacy policies, intellectual property, lack of incentives for sharing, and the academic race-to-publish create a research environment poorly aligned with reproducibility (Semmelrock et al., 2023, Kou, 19 Apr 2024).

A structured framework groups these factors across the stages of the scientific method: data collection (dataset bias, label noise), preprocessing (underspecification), modeling (random seed dependence), tuning (overfitting), implementation (version drift), analysis (unstable metrics, p-hacking), and reporting (incomplete documentation) (Gundersen et al., 2022). Rivera-Landos et al. introduced the “NDIF” (Non-Determinism Introducing Factors) schema to encompass all principal sources of run-to-run divergence (Rivera-Landos et al., 2021).

3. Quantitative Evidence and Impact of Non-Replicability

Several empirical studies document the prevalence and magnitude of irreproducibility:

In a review of 93 deep learning in software engineering papers, only 29% included replication packages, and of those, just 25.8% remained accessible; 10.8% addressed replicability or reproducibility (Liu et al., 2020).
Hutson finds that ~33% of ML papers share data, and much fewer share code (Semmelrock et al., 2023).
Sampled studies display wide run-to-run result spreads: LeNet5 trained for 16 seeds yields accuracies from 8.6% to 99% (Desai et al., 29 Apr 2024, Gundersen et al., 2022); DeepCS (SE code search model) exhibits $c_v \approx 6\%$ between runs, with top-reported scores overstated by up to 12% (Liu et al., 2020).
In NLP and ML, independent reproduction success rates range from ~32% to ~64% depending on author support/access; postpublications often underperform original reports (Desai et al., 29 Apr 2024, Semmelrock et al., 2023).
Non-replicable bandit algorithms elicit high estimator variance and inconsistent inference even as sample size increases—a critical finding for adaptive ML methodologies in digital health and RL domains (Zhang et al., 22 Jul 2024).

A plausible implication is that widespread non-replicability distorts the scientific record, privileges “lucky” outcomes, and biases claims of state-of-the-art advances.

4. Frameworks and Formal Models Addressing Non-Replicable Research

Numerous formal and semi-formal frameworks are designed to diagnose and mitigate non-replicability:

Multistage Validation (Desai et al., 29 Apr 2024): Map the research pipeline from claim to conclusion, specifying at each stage what is fixed or varied in validation studies. This enables differential diagnosis of where non-replicability arises (code vs environment vs protocol vs conceptual design).
Metrology-Inspired Reproducibility Scoring (Belz, 2021): Utilize the sample standard deviation, coefficient of variation, and confidence intervals over repeated measurements as unbiased, comparable metrics of result stability.
Effort of Reproducibility (Akella et al., 2023): Proposes modeling the human and technical cost to reproduce results as an additive function of easiness/difficulty factors; such scores, when instrumented, can serve as both diagnostics for unreliable papers and guides for improving artifact documentation.
Replicable Bandit Algorithms (Zhang et al., 22 Jul 2024): Formally define algorithmic replicability in adaptive experiments, proving that unless the action-selection policies themselves concentrate to deterministic limits, post-hoc inference is fundamentally non-replicable.

Additional recommendations advocate for comprehensive control and disclosure of randomness, versioning, and environment, as well as the adoption of uncertainty quantification throughout the experimentation pipeline (Liu et al., 2020, Gundersen et al., 2022).

5. Broader Epistemic and Ethical Context

Non-replicable ML research is not solely a technical failure but signals deeper epistemic and ethical crises:

Pseudo-Confirmatory Practice: Much empirical ML, especially benchmark-centric method development, is framed as confirmatory (i.e., hypothesis-testing) but in reality is exploratory, undermining the validity of statistical inference and generalization (Herrmann et al., 3 May 2024).
Responsibility Gaps: Limiting replicability to mere model performance (MPR) allows socially consequential claims (e.g., fairness improvements) to escape critical scrutiny, transferring interpretative risk to downstream users. A shift to claim replicability (CR)—requiring that each research claim, not just performance metric, be independently supported—establishes actionable accountability (Kou, 19 Apr 2024).
Incentive Misalignment: Structural disincentives for artifact sharing and careful reporting are reinforced by academic and industry reward systems, which value novelty and headline results over methodological rigor or negative findings (Semmelrock et al., 2023, Akella et al., 2023).

Addressing these pathologies demands a cultural shift: reconciling exploratory and confirmatory modes, supporting dedicated reproducibility venues, and re-aligning professional norms toward transparency, provenance, and evidential diversity (Herrmann et al., 3 May 2024, Kou, 19 Apr 2024).

6. Best Practices, Tools, and Community Recommendations

Successful reduction of non-replicability in ML requires a multi-pronged approach:

Artifact Availability: Host code/data with version control (Git), archive on persistent repositories (e.g., Zenodo, Figshare), and provide containerized environments (Docker, Singularity) (Semmelrock et al., 2023, Liu et al., 2020).
Experiment Specification: Document all hyperparameters, random seeds, hardware/software stack, and data splits; provide explicit pseudocode and tables of design choices (Gundersen et al., 2022).
Robust Evaluation: Report distributions (mean, std, $c_v$ ) over ≥10 seeds; conduct and disclose statistical significance testing; avoid cherry-picking favorable splits or metrics (Belz, 2021, Liu et al., 2020).
Reproducibility Checklists: Adopt, and make compliance visible, using community checklists such as NeurIPS, ACM Artifact Review, or Pineau’s ML Reproducibility Checklist (Semmelrock et al., 2023, Akella et al., 2023).
Uncertainty Quantification: Use variance, confidence intervals, and reporting of negative results to contextualize findings (Liu et al., 2020, Herrmann et al., 3 May 2024).
Community Initiatives: Engage with reproducibility challenges, artifact badging programs, and replication tracks; support open registries and replication-focused journals (Semmelrock et al., 2023, Akella et al., 2023).

A plausible implication is that routine adoption of these practices will increase the baseline effort for non-replicable studies to enter the literature, shifting ML toward a more stable, trustworthy domain.

Summary Table: Validation Categories (after (Desai et al., 29 Apr 2024))

Category	Who Runs It	What Is Fixed	What Is Varied	Outcome Criterion
Repeatability	Original	All (H, D, I)	None	Identical outcomes
Dependent Reproducibility	New team	(H, D, I₀)	Team	Identical outcomes
Independent Reproducibility	New team	(H, D)	Implementation	Same conclusions
Direct Replicability	New team	(H, D)	Implementation	Robust conclusions
Conceptual Replicability	New team	H	Design	Generalized conclusions

The non-replicable ML research crisis arises from systematic computational, methodological, and epistemic failures. It is compounded by weak artifact sharing and entrenched incentive misalignment. Only by adopting explicit definitions, quantitative metrics (such as $c_v$ , $CV^*$ ), principled experiment and reporting standards, and a shift toward claim-focused accountability, can the community restore scientific trustworthiness and real-world impact (Liu et al., 2020, Semmelrock et al., 2023, Belz, 2021, Herrmann et al., 3 May 2024, Kou, 19 Apr 2024, Akella et al., 2023, Gundersen et al., 2022, Desai et al., 29 Apr 2024, Zhang et al., 22 Jul 2024, Rivera-Landos et al., 2021).