- The paper demonstrates that mandatory per-line citations significantly reduce output determinism (e.g., LSS drops from 0.745 to 0.535) while enabling automated hallucination detection.
- It employs controlled experiments across two LLMs and four SDD frameworks, using paired t-tests and effect size metrics for robust statistical validation.
- The study advises high-integrity systems to prioritize verifiability with traceSDD citations despite reduced determinism, highlighting the benefits of structured specification anchoring.
Citation Discipline in Spec-Driven LLM Code Generation: An Analytical Review
Overview
This paper provides a comprehensive empirical evaluation of how citation mechanisms—specifically, mandatory per-line requirement citations—impact both the determinism and hallucination detectability of LLM-generated code in specification-driven development (SDD) frameworks. By executing controlled experiments across two state-of-the-art frontier LLMs (Claude Sonnet 4.6 and GLM-5-turbo), the study benchmarks three SDD paradigms—traceSDD (with and without citations), Spec Kit, and OpenSpec—for output determinism, hallucination detection, and functional correctness. A critical finding is a robust trade-off: while forcing citations reduces determinism, it uniquely enables automated detection of specification deviations ("hallucinations"), consistently and independently of the underlying LLM architecture.
Experimental Design and Methodology
The analysis leverages a factorial experimental design:
- Tasks: 20 diverse Python programming tasks for Claude, 50 for GLM, spanning 8 engineering domains and three difficulty levels.
- Conditions: Four frameworks—traceSDD (cited), traceSDD (uncited), Spec Kit, OpenSpec—each tested with three cold-start LLM runs per task.
- Metrics:
- Lexical Similarity Score (LSS): Diff-based similarity across independent LLM outputs for determinism quantification.
- Traceability Detection Rate (TDR): The fraction of injected hallucinations (specification deviations) detected automatically via orphan requirement reference (orphan-REQ) analysis.
- Functional Correctness (TPR): Test-pass rate to assert that specification discipline does not impair correctness.
- False Positive Rate (FPR): Ensures that hallucination detection mechanisms do not spuriously flag correct code.
Statistical analysis includes paired t-tests, Wilcoxon signed-rank tests, and effect size reporting (Cohen's d, Cliff's delta), supported by multi-level corrections and per-task breakdowns.
Determinism and the Effect of Citations
The core empirical result is that mandatory per-line citation annotations produce a significant and consistent reduction in output determinism:
The per-task analysis further reveals that the determinism penalty of citations is most acute in easy and small-scale tasks, decaying for complex, larger codebases—likely due to baseline output variability overwhelming citation-related noise.
Figure 2: Citation determinism penalty by difficulty level and size class (GLM-5-turbo, N=50). The penalty is largest for easy tasks (d=−1.85) and small tasks (d=−0.95), but not significant for hard tasks (d=−0.34, ns) or large tasks (d=−0.21, ns).
Automated Detection of Hallucinations
A core finding is that only the cited traceSDD condition enables automated, high-precision detection of specification deviations using the orphan-REQ check:
Despite the high average TDR, detection is not perfect—a minority of hallucinations are realized as uncited lines, escaping the orphan-REQ check. The practical implication is that while citation discipline delivers a unique, automated avenue for hallucination detection, it is bounded by its coverage of cited code lines, and further enhancements (e.g., coverage for uncited lines) may be necessary for maximal assurance.
Framework Ranking and Implications
The synthesis of determinism and detection data provides actionable recommendations:
- For High-Integrity Systems: When verifiability, traceability, and automated hallucination detection are paramount (regulated industries, safety-critical systems), traceSDD with citations offers an unmatched tool despite reduced determinism.
- For Rapid Prototyping: Uncited traceSDD yields optimal determinism while retaining the benefits of structured specification anchoring, suitable where automated deviation detection is less critical.
- Spec Kit and OpenSpec: These frameworks offer neither superior determinism nor detection capability compared to the two traceSDD conditions under the parameters studied.
The per-condition and effect size analyses (see forest plot) reinforce the primary conclusions regarding framework capabilities and trade-offs.
Figure 4: Forest plot of Cohen's d with 95% confidence intervals for all cross-condition comparisons. Circles indicate significant comparisons (p<0.05); squares indicate non-significant comparisons. The citation isolation effect (top two rows) is remarkably consistent across both models.
Maintenance, Scalability, and Theoretical Significance
Beyond immediate functional implications, citation discipline introduces structural affordances for long-term maintenance. Inline citations enable live, filterable traceability graphs; e.g., querying all code sections implicated by a requirement is trivial in traceSDD, scaling well with codebase size and requirement volume.
From a theoretical perspective, the deterministic variability induced by citation discipline exposes the tension between verifiability (via explicit, auditable claims) and output stability in stochastic code generation. This insight generalizes to other domains of prompt engineering and program synthesis—where annotation overheads may always trade off against lexical consistency, independent of LLM backend.
Future Directions
The study highlights several avenues for further research:
- Extension to additional LLMs (GPT-4, Gemini, Llama, Mistral) to test universality.
- Replication on production-scale, multi-module projects.
- Exploration of hybrid approaches combining determinism-centric and verification-centric paradigms.
- Longitudinal studies of maintenance effort and defect rate modulation by citation traceability.
- Integration of decoding parameters (e.g., temperature) and prompt engineering with citation behavior for deeper model insight.
Conclusion
This paper delivers a robust, multi-model empirical analysis establishing that per-line citation discipline in SDD frameworks yields a predictable, quantifiable reduction in code generation determinism, concomitant with uniquely effective automated hallucination detection. Structured, REQ-format specifications—when uncited—offer the best determinism anchor among all tested approaches, but the presence of citation annotations is the only way to fully automate hallucination flagging without manual review. The determinism-detection trade-off elucidated here is fundamental, specification-format invariant, and provides principled guidance for both theoretical and applied AI-assisted software engineering research.
Figure 5: Per-task LSS distribution across four conditions (Claude Sonnet 4.6, N=20 tasks). Each point represents one pairwise comparison (run_1 vs run_2 or run_1 vs run_3). The uncited condition shows higher and more consistent LSS than the cited condition.