Ascertain genuineness of reported 96.8% Verina performance under specification leakage

Determine to what extent the reported 96.8% performance on the Verina benchmark reflects genuine advances in code synthesis, given that the evaluation does not control for specification leakage and the underlying results are not publicly examinable.

Background

The paper discusses "specification leakage" in the Verina benchmark, where the specification may encode a computable function, allowing trivial implementations and proofs that inflate measured performance. This complicates interpreting benchmark results as genuine progress in verified code generation.

A public claim reports 96.8% performance on Verina; however, without controlling for specification leakage and without access to detailed outputs, the authors note that it is unclear whether such a high score reflects true advances in code synthesis rather than exploitation of leakage.

The authors propose enforcing imperativeness on implementations while allowing functional specifications as a mitigation strategy, but they highlight that resolving the interpretation of the reported 96.8% performance remains an open question pending controlled evaluation and transparent result inspection.

References

Harmonic has reported 96.8% performance on Verina. Without controlling for specification leakage and without being able to examine the results, it remains unclear to what extent such results reflect genuine advances in code synthesis.

WybeCoder: Verified Imperative Code Generation  (2603.29088 - Gloeckle et al., 31 Mar 2026) in Section 3, Specification Leakage (footnote)