- The paper introduces a novel approach that replaces internal simulation with concrete execution feedback to bridge the mental-reality gap in LLM code generation.
- It employs a S.O.L.I.D. architecture—including live execution, oracle-based assertions, and defensive accumulation—to overcome specification and verification gaps, achieving state-of-the-art performance on benchmarks.
- The results demonstrate significant improvements with metrics such as 95.7% pass@1 on HumanEval, highlighting its practical impact on robust, real-world code synthesis and regression prevention.
SolidCoder: Concrete Execution for Bridging the Mental-Reality Gap in LLM Code Generation
Introduction and Motivation
Despite significant advances in LLM-based code generation, recent evaluation reveals a persistent reliability challenge: models frequently validate incorrect or buggy code due to hallucination and overreliance on internal simulation. The core issue, termed the "Mental-Reality Gap," arises when LLMs mentally simulate code execution for planning and debugging, often producing imagined traces and self-affirming faulty implementations. This manifests in two key failure axes: the Specification Gap (incomplete reasoning about edge cases) and the Verification Gap (hallucinated trace-based validation).
MapCoder and subsequent agent-based pipelines—exemplified by CodeSIM—extended LLM systems with explicit planning, analogy, and iterative debugging, but ultimately grounded both specification and validation in the same fragile text-based simulation loop. Empirical evidence demonstrates that these frameworks hallucinate correctness for flawed code, which directly limits generalization to competitive and real-world software engineering benchmarks.
The S.O.L.I.D. Architecture
SolidCoder presents a structured remedy to the Mental-Reality Gap: replacing internal simulation with concrete execution feedback within an integrated multi-agent pipeline. The system’s design principle is simple—don't imagine, execute—and is operationalized by the S.O.L.I.D. methodology, which addresses both major axes of the gap:
Experimental Results and Analysis
SolidCoder attains state-of-the-art pass@1 performance across standard benchmarks. On GPT-4o, it reports 95.7% on HumanEval (+0.6% over CodeSIM), 77.0% on CodeContests (+4.3%), and 26.7% on APPS (+3.4%). Similar gains are seen on RL post-trained models, establishing robustness to both architecture and training regime. The largest margin emerges on medium difficulty tasks (CodeContests), where simulation unreliability is most acute and execution-based falsification is both computationally tractable and impactful.
Ablation studies demonstrate strong component effects:
- Removal of Shift-left Planning causes a dramatic –23.7% drop on CodeContests, highlighting persistent LLM blindness to boundary cases.
- Excluding Live Execution reduces performance by –7.9%, underscoring that hallucinated validation remains a nontrivial bottleneck.
- Oracle-based Assertions and Defensive Accumulation contribute –11.6% and –6.7% drops, showing property-based and regression-proofing checks are non-redundant.
The system’s core claim is bold: bridging both specification and verification via execution grounding is necessary for robust and general code synthesis. Gains from the two axes are largely additive, and neither is sufficient alone.
A canonical qualitative illustration is provided by direct CodeSIM vs. SolidCoder comparison on a list rotation error: mental simulation in CodeSIM yields a spurious PASS, while SolidCoder’s live execution triggers a concrete assertion failure, immediately surfacing the bug.
Figure 2: The Mental-Reality Gap on a rotation problem—only empirical execution (right) surfaces latent errors that simulated reasoning (left) overlooks.
Practical and Theoretical Implications
Adopting concrete execution, coupled with property-based oracles, substantially mitigates the LLM hallucination pathology endemic to prior agentic pipelines. This refactoring of self-evaluation modalities exposes two broader implications for scalable code synthesis frameworks:
- Inference-time safety and regression preservation: Defensive Accumulation ensures old bugs do not resurface post-fix, an essential property for deployment in continuous integration and automated patching pipelines.
- Oracle-free verification: SolidCoder decouples code validation from ground-truth output access, extending applicability to settings—such as novel competitive problems—where only input constraints and invariants are available.
As LLMs are increasingly augmented with tool use and programmatic APIs, concrete execution will become a default primitive for robust software agents. SolidCoder’s results suggest the next bottleneck is not primitive code generation but rather rigorous, interpretable, and compositional self-verification; the design pattern extends naturally to repository-level, multi-language, and real-world development workflows.
Limitations and Future Directions
While SolidCoder’s evaluation is limited to function-level, Python-centric problems, the approach is extensible in principle. Language sandboxing, broader property oracle synthesis, and efficiency-aware routing are the next frontiers—especially as token and call overhead remain non-negligible on easy or already-solved tasks.
Further, under heterogeneous model compositions and cross-agent communication (e.g., different LLMs for code and test generation), propagation of systemic biases in oracle generation must be studied.
Conclusion
SolidCoder offers a methodical approach for closing the Mental-Reality Gap in LLM code generation by internalizing concrete execution at every critical planning and validation stage. Evidence demonstrates strong, consistent improvements over strong simulation-based pipelines and highlights the necessity of separating specification and verification axes for reliable, generalizable autonomous programming. Future research must generalize these principles to broader codebases, richer oracles, and more efficient conditional inference.