Reproducibility of DeepCoder’s reported 60.6% LiveCodeBench Pass@1 accuracy

Determine whether the reported 60.6% Pass@1 accuracy of the DeepCoder 14B model on LiveCodeBench can be reproduced under the same sampling parameters and software environment described in the DeepCoder evaluation.

Background

In evaluating OAPL against the GRPO-trained DeepCoder model on LiveCodeBench, the authors report that their measured DeepCoder accuracy is lower than the originally reported 60.6% Pass@1. Despite efforts to match the evaluation settings, they could not replicate the result and note that others have also encountered similar difficulties.

This indicates an explicit reproducibility uncertainty regarding the originally reported LiveCodeBench performance for DeepCoder, motivating a clear verification task focused on replicating the claim under the stated evaluation configuration.

References

We were unable to replicate the 60.6% result despite our best efforts to match their sampling parameters and software environment.

LLMs Can Learn to Reason Via Off-Policy RL  (2602.19362 - Ritter et al., 22 Feb 2026) in Section: Results on Code Generation, Pass@k performance paragraph (footnote)