Ascertain genuineness of reported 96.8% Verina performance under specification leakage
Determine to what extent the reported 96.8% performance on the Verina benchmark reflects genuine advances in code synthesis, given that the evaluation does not control for specification leakage and the underlying results are not publicly examinable.
References
Harmonic has reported 96.8% performance on Verina. Without controlling for specification leakage and without being able to examine the results, it remains unclear to what extent such results reflect genuine advances in code synthesis.
— WybeCoder: Verified Imperative Code Generation
(2603.29088 - Gloeckle et al., 31 Mar 2026) in Section 3, Specification Leakage (footnote)