SolidCoder: Bridging the Mental-Reality Gap in LLM Code Generation through Concrete Execution

Published 20 Apr 2026 in cs.SE and cs.AI | (2604.19825v1)

Abstract: State-of-the-art code generation frameworks rely on mental simulation, where LLMs internally trace execution to verify correctness. We expose a fundamental limitation: the Mental-Reality Gap -- where models hallucinate execution traces and confidently validate buggy code. This gap manifests along two orthogonal dimensions: the Specification Gap (overlooking edge cases during planning) and the Verification Gap (hallucinating correct behavior for flawed code). We propose SolidCoder with a simple principle: don't imagine -- execute. The S.O.L.I.D. architecture addresses both dimensions by forcing edge-case awareness before algorithm design and replacing imagined traces with sandboxed execution using property-based oracles. With GPT-4o, SolidCoder achieves state-of-the-art pass@1 performance: 95.7% on HumanEval (+0.6%p), 77.0% on CodeContests (+4.3%p), and 26.7% on APPS (+3.4%p). Ablation reveals that edge-case awareness provides the largest individual gain, while execution grounding catches categorically different errors that specification improvements cannot address. These gains generalize to RL post-trained models, validating that bridging both gap dimensions is essential for robust code synthesis. We release our code and framework to facilitate future research.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a novel approach that replaces internal simulation with concrete execution feedback to bridge the mental-reality gap in LLM code generation.
It employs a S.O.L.I.D. architecture—including live execution, oracle-based assertions, and defensive accumulation—to overcome specification and verification gaps, achieving state-of-the-art performance on benchmarks.
The results demonstrate significant improvements with metrics such as 95.7% pass@1 on HumanEval, highlighting its practical impact on robust, real-world code synthesis and regression prevention.

SolidCoder: Concrete Execution for Bridging the Mental-Reality Gap in LLM Code Generation

Introduction and Motivation

Despite significant advances in LLM-based code generation, recent evaluation reveals a persistent reliability challenge: models frequently validate incorrect or buggy code due to hallucination and overreliance on internal simulation. The core issue, termed the "Mental-Reality Gap," arises when LLMs mentally simulate code execution for planning and debugging, often producing imagined traces and self-affirming faulty implementations. This manifests in two key failure axes: the Specification Gap (incomplete reasoning about edge cases) and the Verification Gap (hallucinated trace-based validation).

MapCoder and subsequent agent-based pipelines—exemplified by CodeSIM—extended LLM systems with explicit planning, analogy, and iterative debugging, but ultimately grounded both specification and validation in the same fragile text-based simulation loop. Empirical evidence demonstrates that these frameworks hallucinate correctness for flawed code, which directly limits generalization to competitive and real-world software engineering benchmarks.

The S.O.L.I.D. Architecture

SolidCoder presents a structured remedy to the Mental-Reality Gap: replacing internal simulation with concrete execution feedback within an integrated multi-agent pipeline. The system’s design principle is simple—don't imagine, execute—and is operationalized by the S.O.L.I.D. methodology, which addresses both major axes of the gap:

Shift-left Planning (S): Proactively surfaces edge cases before code or plan generation, ensuring that algorithmic designs are forced to address boundary and pathological scenarios upfront. The planning agent is explicitly prompted for worst-case and special pattern inputs.
Oracle-based Assertions (O): Shifts verification from output matching to property-based invariants, enabling self-judgment of correctness using domain-preserving predicates rather than ground-truth I/O—crucial for problems where the "oracle problem" (unknown true outputs) blocks traditional execution-based checking.
Live Execution (L): Systematically replaces imagined traces with actual code execution in a sandboxed environment. Verification verdicts are based on runtime outcomes, not predicted ones.
Intermediate Simulation (I): Retains simulated mental execution as a cost-effective pre-filter but only as an auxiliary layer; the final verdict is grounded in live outcomes.
Defensive Accumulation (D): Accumulates all discovered failing cases throughout the iterative refinement loop, guaranteeing monotonic non-regression in bugfixing across the verification process.
Figure 1: Comparative pipelines—SolidCoder eliminates CodeSIM’s reliance on faulty internal simulation by introducing concrete execution and S.O.L.I.D. safeguards at all critical phases.

Experimental Results and Analysis

SolidCoder attains state-of-the-art pass@1 performance across standard benchmarks. On GPT-4o, it reports 95.7% on HumanEval (+0.6% over CodeSIM), 77.0% on CodeContests (+4.3%), and 26.7% on APPS (+3.4%). Similar gains are seen on RL post-trained models, establishing robustness to both architecture and training regime. The largest margin emerges on medium difficulty tasks (CodeContests), where simulation unreliability is most acute and execution-based falsification is both computationally tractable and impactful.

Ablation studies demonstrate strong component effects:

Removal of Shift-left Planning causes a dramatic –23.7% drop on CodeContests, highlighting persistent LLM blindness to boundary cases.
Excluding Live Execution reduces performance by –7.9%, underscoring that hallucinated validation remains a nontrivial bottleneck.
Oracle-based Assertions and Defensive Accumulation contribute –11.6% and –6.7% drops, showing property-based and regression-proofing checks are non-redundant.

The system’s core claim is bold: bridging both specification and verification via execution grounding is necessary for robust and general code synthesis. Gains from the two axes are largely additive, and neither is sufficient alone.

A canonical qualitative illustration is provided by direct CodeSIM vs. SolidCoder comparison on a list rotation error: mental simulation in CodeSIM yields a spurious PASS, while SolidCoder’s live execution triggers a concrete assertion failure, immediately surfacing the bug.

Figure 2: The Mental-Reality Gap on a rotation problem—only empirical execution (right) surfaces latent errors that simulated reasoning (left) overlooks.

Practical and Theoretical Implications

Adopting concrete execution, coupled with property-based oracles, substantially mitigates the LLM hallucination pathology endemic to prior agentic pipelines. This refactoring of self-evaluation modalities exposes two broader implications for scalable code synthesis frameworks:

Inference-time safety and regression preservation: Defensive Accumulation ensures old bugs do not resurface post-fix, an essential property for deployment in continuous integration and automated patching pipelines.
Oracle-free verification: SolidCoder decouples code validation from ground-truth output access, extending applicability to settings—such as novel competitive problems—where only input constraints and invariants are available.

As LLMs are increasingly augmented with tool use and programmatic APIs, concrete execution will become a default primitive for robust software agents. SolidCoder’s results suggest the next bottleneck is not primitive code generation but rather rigorous, interpretable, and compositional self-verification; the design pattern extends naturally to repository-level, multi-language, and real-world development workflows.

Limitations and Future Directions

While SolidCoder’s evaluation is limited to function-level, Python-centric problems, the approach is extensible in principle. Language sandboxing, broader property oracle synthesis, and efficiency-aware routing are the next frontiers—especially as token and call overhead remain non-negligible on easy or already-solved tasks.

Further, under heterogeneous model compositions and cross-agent communication (e.g., different LLMs for code and test generation), propagation of systemic biases in oracle generation must be studied.

Conclusion

SolidCoder offers a methodical approach for closing the Mental-Reality Gap in LLM code generation by internalizing concrete execution at every critical planning and validation stage. Evidence demonstrates strong, consistent improvements over strong simulation-based pipelines and highlights the necessity of separating specification and verification axes for reliable, generalizable autonomous programming. Future research must generalize these principles to broader codebases, richer oracles, and more efficient conditional inference.

Markdown Report Issue