Propensity Smelly Score (PSC) in LLM Code
- PSC is a probabilistic metric that aggregates token probabilities over code spans to quantify the likelihood of LLM-generated code smells.
- It employs mean, median, and relative variants to provide interpretable signals and enable thresholding for practical smell mitigation.
- Empirical validation and causal analysis show that PSC enhances prompt-engineering strategies and improves human interpretability in software quality triage.
The Propensity Smelly Score (PSC) is a probabilistic metric designed to quantify the likelihood of specific code smell instances emerging in code generated by LLMs. PSC aggregates next-token probabilities over code spans detected by static analysis tools, providing an interpretable signal of model “confidence” in emitting smelly code snippets. It offers a precise, likelihood-based measure of smell propensity, is empirically validated for robustness and informativeness, and supports causal analysis of LLM generation factors. PSC further enables practical mitigation of smells through prompt-engineering and acts as a heuristic for human interpretability in software quality triage (Velasco et al., 19 Nov 2025).
1. Mathematical Definition of Propensity Smelly Score
Let denote the LLM-generated token sequence, and let a static analyzer (e.g., Pylint) associate a smell with the token span via alignment function . Three PSC variants are formally defined:
- Mean PSC:
- Median PSC:
- Relative PSC (rescaled to using empirical minima and maxima ):
where ensures numerical stability. A higher PSC value implies greater model propensity or confidence in generating the given smelly code snippet.
2. Estimation and Computation Procedure
PSC computation for an LLM-generated code sample involves the following steps:
- Token-Level Probabilities: During autoregressive decoding, token-level logits are converted to probabilities by:
- Span Alignment: A static analyzer identifies a smell instance with a character range, mapped to token indices by alignment function .
- Score Aggregation: Compute mean, median, or relative PSC over tokens . Median and relative aggregations are most commonly used.
- Thresholding (for Interpretability): A smell may be labeled “likely” if , typically with . This threshold is applied post hoc, not as part of the score itself.
3. Empirical Validation and Robustness of PSC
Two complementary validations establish PSC’s utility as a structural code quality measure:
- Semantic-Preserving Transformations (SECT): Six statement-level edits (e.g., Add2Equal, SwitchRelation) are applied, and computed for original and mutated snippets. ANOVA on logit-transformed scores reveals that 31/41 analyzed smells (76%) exhibit statistical stability (, ) under syntax-preserving edits. Smells linked to naming/formatting (e.g., invalid-name) naturally show higher sensitivity.
- Information Gain (IG): Using a binary severity label (defined by token-level flag proportion), PSC demonstrates higher information gain than BLEU or CodeBLEU scores across all smell categories. This suggests PSC provides a more informative reduction of uncertainty regarding severe smells in generated code.
4. PSC as a Causal Instrument for Generation Analysis
PSC serves as the outcome variable in a structural causal model (SCM) designed to relate coding pipeline variables to smell propensity, accounting for confounders (LOC, AST size, token count, part-of-speech frequencies, etc.). In Pearl’s notation:
- (decoding strategy: greedy, beam, top-k, top-p, contrastive, sampling)
- (model size: 0.5B, 1.5B, 3B, 7B)
- (architecture: CodeLlama, Mistral, Qwen, StarCoder)
- (prompt formulation: minimal, generic “Complete code,” role-preamble, explicit smell-avoidance instructions)
ATE (Average Treatment Effect) for intervention is:
where is the default setting. Estimates are obtained via stratification/regression over , utilizing the doWhy library. Random confounders, placebo treatments, simulated unmeasured confounding, and subset stability refutation tests confirm robustness.
5. Experimental Findings: Factors Influencing PSC
A tabular summary (Table 3 in the cited study) condenses factor impacts. Positive ATE denotes increased PSC (worse structural quality), negative ATE denotes decreased PSC (improved quality):
| Factor | Effect on PSC | Example Implications |
|---|---|---|
| Generation Strategy | Sampling (top-k, top-p, contrastive) reduces PSC for semantic/refactor smells but increases PSC for surface-level formatting smells | More randomness improves semantic quality but can worsen formatting |
| Model Size | Scaling 0.5B → 7B yields negligible ATE ( | ATE |
| Architecture | Switching CodeLlama to Mistral/Qwen/StarCoder (same size) significantly lowers PSC (ATE up to −0.35) for warning/refactor smells | Model design and pretraining strongly influence smell rates |
| Prompt Formulation | Using explicit smell-avoidance prompt yields large ATE reductions (≈ −0.25...−0.40) for serious smells | Prompt-engineering is the most practical and effective means for mitigation |
This suggests prompt formulation is the strongest practical lever for reducing smell propensity without retraining, followed by architectural choices.
6. Mitigation Protocols Leveraging PSC
Supported by causal analysis, a prompt-engineering protocol utilizes PSC to selectively mitigate smell instances during inference. An outline:
- Start with base prompt and generate output.
- Use a static analyzer to detect smell spans.
- Compute PSC for each span. If PSC exceeds (e.g. 0.5), switch to a mitigation prompt:
1 2 3
You are an expert software engineer. Please complete the following code without introducing any of these smells: {list all smell types}. <smelly_function> - Regenerate code as needed.
Empirically, appending the “do not introduce these smells” instruction reduced median PSC as follows: broad-exception-raised W0719 from 0.80→0.67, missing-final-newline C0304 from 0.76→0.72, unused-import W0611 from 0.52→0.23. A plausible implication is that such prompt-level interventions are highly actionable for practitioners, especially for high-propensity smells.
7. Human Interpretability and User Study Insights
A between-subjects survey (n=36) compared the impact of PSC versus standard textual smell labels. Key findings:
- For unused-argument W0613, PSC increased perceived “importance to fix” and rater confidence.
- For unused-variable W0612, PSC elevated the sense of “systematic model error.”
- For redefined-builtin W0622, PSC heightened both “systematic” attribution and “importance to fix.”
Open responses revealed developers feel urgency and attention when PSC is high, and reassurance when PSC is low, with PSC acting as a heuristic for ambiguous or subtle smells otherwise easily overlooked. This suggests that PSC augments traditional smell labels, improving human judgment and triage efficacy.
Taken collectively, the Propensity Smelly Score is a fundamentally precise, likelihood-based measure for analyzing, explaining, and mitigating code smell risk in LLM-generated software. Its mathematical tractability, empirical robustness, utility in causal diagnostics, amenability to simple mitigation, and support for human interpretability highlight its significance in quality-aware assessment and deployment of generative models for code (Velasco et al., 19 Nov 2025).