Citation Discipline in Spec-Driven Development: A Cross-Model Empirical Study of Output Determinism and Automated Hallucination Detection in LLM-Generated Code

Published 28 Jun 2026 in cs.SE and cs.AI | (2606.30689v1)

Abstract: Spec-Driven Development (SDD) frameworks guide LLM-powered code generation through formal specifications, yet they differ fundamentally in how they enforce traceability between requirements and generated code. This paper presents two controlled empirical studies comparing three SDD frameworks: $traceSDD$, which enforces mandatory per-line requirement citations using hierarchical REQ-XXX.Y.Z identifiers; $Spec Kit$, which uses artifact-level traceability through user stories and acceptance criteria; and $OpenSpec$, which relies on post-hoc external trace maps. We measure two primary outcomes across two frontier LLMs -- Claude Sonnet 4.6 (N=20, 4 conditions, 240 implementations) and GLM-5-turbo (N=50, 4 conditions, 600 implementations): $output$ $determinism$ (lexical similarity across independent LLM sessions) and $automated$ $hallucination$ $detection$ $rate$ (TDR). Our pre-registered analysis reveals a consistent, cross-model replicated trade-off: the uncited condition produces significantly higher determinism than the cited condition (Claude: $d=-0.76$, $p=0.003$; GLM: $d=-0.72$, $p<0.001$), while only the cited condition enables automated hallucination detection (TDR: Claude 86.4%, GLM 88.0%, vs 0% for all alternatives, FPR=0% across both studies). traceSDD (cited) significantly outperforms $Spec Kit$ on determinism (Claude: $d=0.47$, $p=0.049$; GLM: $d=0.42$, $p=0.003$) but not OpenSpec (Claude: $d=0.18$, $p=0.44$; GLM: $d=0.14$, $p=0.32$). These findings establish that citation annotations trade determinism for verifiability, and that this trade-off generalizes across model architectures.

Abstract PDF Upgrade to Chat

Authors (1)

Subham Panda

Summary

The paper demonstrates that mandatory per-line citations significantly reduce output determinism (e.g., LSS drops from 0.745 to 0.535) while enabling automated hallucination detection.
It employs controlled experiments across two LLMs and four SDD frameworks, using paired t-tests and effect size metrics for robust statistical validation.
The study advises high-integrity systems to prioritize verifiability with traceSDD citations despite reduced determinism, highlighting the benefits of structured specification anchoring.

Citation Discipline in Spec-Driven LLM Code Generation: An Analytical Review

Overview

This paper provides a comprehensive empirical evaluation of how citation mechanisms—specifically, mandatory per-line requirement citations—impact both the determinism and hallucination detectability of LLM-generated code in specification-driven development (SDD) frameworks. By executing controlled experiments across two state-of-the-art frontier LLMs (Claude Sonnet 4.6 and GLM-5-turbo), the study benchmarks three SDD paradigms—traceSDD (with and without citations), Spec Kit, and OpenSpec—for output determinism, hallucination detection, and functional correctness. A critical finding is a robust trade-off: while forcing citations reduces determinism, it uniquely enables automated detection of specification deviations ("hallucinations"), consistently and independently of the underlying LLM architecture.

Experimental Design and Methodology

The analysis leverages a factorial experimental design:

Tasks: 20 diverse Python programming tasks for Claude, 50 for GLM, spanning 8 engineering domains and three difficulty levels.
Conditions: Four frameworks—traceSDD (cited), traceSDD (uncited), Spec Kit, OpenSpec—each tested with three cold-start LLM runs per task.
Metrics:
- Lexical Similarity Score (LSS): Diff-based similarity across independent LLM outputs for determinism quantification.
- Traceability Detection Rate (TDR): The fraction of injected hallucinations (specification deviations) detected automatically via orphan requirement reference (orphan-REQ) analysis.
- Functional Correctness (TPR): Test-pass rate to assert that specification discipline does not impair correctness.
- False Positive Rate (FPR): Ensures that hallucination detection mechanisms do not spuriously flag correct code.

Statistical analysis includes paired t-tests, Wilcoxon signed-rank tests, and effect size reporting (Cohen's $d$ , Cliff's delta), supported by multi-level corrections and per-task breakdowns.

Determinism and the Effect of Citations

The core empirical result is that mandatory per-line citation annotations produce a significant and consistent reduction in output determinism:

Quantitative Results:
- In both Claude and GLM, the uncited traceSDD condition achieves the highest output determinism (mean LSS: 0.745 and 0.644, respectively), with the cited traceSDD condition scoring lower (0.535 and 0.510).
- The observed effect size for determinism penalty due to citations is medium-to-large (Cohen's $d \approx -0.72$ in both models), highly significant ( $p < 0.001$ ).
Interpretation: Citation placement choices (e.g., granularity, repetition) introduce variability orthogonal to code logic, which increases Levenshtein/string-level divergence across independent LLM generations.
Framework Comparison: Structured REQ-format specifications (even when uncited) outperform commercial baselines (Spec Kit, OpenSpec) in determinism, confirming that specification structure itself offers anchoring.
Figure 1: Mean LSS by condition across both LLMs. Error bars represent one standard deviation. The uncited condition consistently achieves the highest determinism, while both traceSDD variants outperform Spec Kit and OpenSpec baselines.

The per-task analysis further reveals that the determinism penalty of citations is most acute in easy and small-scale tasks, decaying for complex, larger codebases—likely due to baseline output variability overwhelming citation-related noise.

Figure 2: Citation determinism penalty by difficulty level and size class (GLM-5-turbo, N=50). The penalty is largest for easy tasks ( $d = -1.85$ ) and small tasks ( $d = -0.95$ ), but not significant for hard tasks ( $d = -0.34$ , ns) or large tasks ( $d = -0.21$ , ns).

Automated Detection of Hallucinations

A core finding is that only the cited traceSDD condition enables automated, high-precision detection of specification deviations using the orphan-REQ check:

Detection Efficacy: TDR reaches 86.4% (Claude) and 88.0% (GLM) in the cited condition, with 0% false positive rate. All other frameworks—including uncited traceSDD—yield 0% TDR.
Orphan-REQ Mechanism: Simple set-difference between cited requirement IDs in code and those in the specification enables O(1) detection of unauthorized features, prior-art bleed, or out-of-spec engineering.
Uncited Control: Absence of inline citations eliminates the ability to perform automated hallucination detection, even with structured specs.
Figure 3: Traceability Detection Rate by condition across both LLMs. Only the traceSDD (cited) condition achieves non-zero automated detection. The uncited condition is particularly instructive: it uses the identical REQ-format specification but achieves 0% TDR because no citations appear in the generated code.

Despite the high average TDR, detection is not perfect—a minority of hallucinations are realized as uncited lines, escaping the orphan-REQ check. The practical implication is that while citation discipline delivers a unique, automated avenue for hallucination detection, it is bounded by its coverage of cited code lines, and further enhancements (e.g., coverage for uncited lines) may be necessary for maximal assurance.

Framework Ranking and Implications

The synthesis of determinism and detection data provides actionable recommendations:

For High-Integrity Systems: When verifiability, traceability, and automated hallucination detection are paramount (regulated industries, safety-critical systems), traceSDD with citations offers an unmatched tool despite reduced determinism.
For Rapid Prototyping: Uncited traceSDD yields optimal determinism while retaining the benefits of structured specification anchoring, suitable where automated deviation detection is less critical.
Spec Kit and OpenSpec: These frameworks offer neither superior determinism nor detection capability compared to the two traceSDD conditions under the parameters studied.

The per-condition and effect size analyses (see forest plot) reinforce the primary conclusions regarding framework capabilities and trade-offs.

Figure 4: Forest plot of Cohen's d with 95% confidence intervals for all cross-condition comparisons. Circles indicate significant comparisons ( $p < 0.05$ ); squares indicate non-significant comparisons. The citation isolation effect (top two rows) is remarkably consistent across both models.

Maintenance, Scalability, and Theoretical Significance

Beyond immediate functional implications, citation discipline introduces structural affordances for long-term maintenance. Inline citations enable live, filterable traceability graphs; e.g., querying all code sections implicated by a requirement is trivial in traceSDD, scaling well with codebase size and requirement volume.

From a theoretical perspective, the deterministic variability induced by citation discipline exposes the tension between verifiability (via explicit, auditable claims) and output stability in stochastic code generation. This insight generalizes to other domains of prompt engineering and program synthesis—where annotation overheads may always trade off against lexical consistency, independent of LLM backend.

Future Directions

The study highlights several avenues for further research:

Extension to additional LLMs (GPT-4, Gemini, Llama, Mistral) to test universality.
Replication on production-scale, multi-module projects.
Exploration of hybrid approaches combining determinism-centric and verification-centric paradigms.
Longitudinal studies of maintenance effort and defect rate modulation by citation traceability.
Integration of decoding parameters (e.g., temperature) and prompt engineering with citation behavior for deeper model insight.

Conclusion

This paper delivers a robust, multi-model empirical analysis establishing that per-line citation discipline in SDD frameworks yields a predictable, quantifiable reduction in code generation determinism, concomitant with uniquely effective automated hallucination detection. Structured, REQ-format specifications—when uncited—offer the best determinism anchor among all tested approaches, but the presence of citation annotations is the only way to fully automate hallucination flagging without manual review. The determinism-detection trade-off elucidated here is fundamental, specification-format invariant, and provides principled guidance for both theoretical and applied AI-assisted software engineering research.

Figure 5: Per-task LSS distribution across four conditions (Claude Sonnet 4.6, N=20 tasks). Each point represents one pairwise comparison (run_1 vs run_2 or run_1 vs run_3). The uncited condition shows higher and more consistent LSS than the cited condition.