Papers
Topics
Authors
Recent
Search
2000 character limit reached

Propensity Smelly Score (PSC) in LLM Code

Updated 19 January 2026
  • PSC is a probabilistic metric that aggregates token probabilities over code spans to quantify the likelihood of LLM-generated code smells.
  • It employs mean, median, and relative variants to provide interpretable signals and enable thresholding for practical smell mitigation.
  • Empirical validation and causal analysis show that PSC enhances prompt-engineering strategies and improves human interpretability in software quality triage.

The Propensity Smelly Score (PSC) is a probabilistic metric designed to quantify the likelihood of specific code smell instances emerging in code generated by LLMs. PSC aggregates next-token probabilities over code spans detected by static analysis tools, providing an interpretable signal of model “confidence” in emitting smelly code snippets. It offers a precise, likelihood-based measure of smell propensity, is empirically validated for robustness and informativeness, and supports causal analysis of LLM generation factors. PSC further enables practical mitigation of smells through prompt-engineering and acts as a heuristic for human interpretability in software quality triage (Velasco et al., 19 Nov 2025).

1. Mathematical Definition of Propensity Smelly Score

Let w=(w1,w2,...,wn)w = (w_1, w_2, ..., w_n) denote the LLM-generated token sequence, and let a static analyzer (e.g., Pylint) associate a smell μ\mu with the token span (i,j)(i,j) via alignment function δμ(w)\delta_\mu(w). Three PSC variants are formally defined:

  • Mean PSC:

θμmean(w,i,j)=1ji+1k=ijP(wkw1,...,wk1)\theta_\mu^{\text{mean}}(w,i,j) = \frac{1}{j-i+1} \sum_{k=i}^{j} P(w_k \mid w_1, ..., w_{k-1})

  • Median PSC:

θμmedian(w,i,j)=median({P(wkw1,...,wk1):k=i...j})\theta_\mu^{\text{median}}(w,i,j) = \text{median}(\{ P(w_k \mid w_1, ..., w_{k-1}) : k = i...j \})

  • Relative PSC (rescaled to [0,1][0,1] using empirical minima Pmin(wk)P_{min}(w_k) and maxima Pmax(wk)P_{max}(w_k)):

θμrel(w,i,j)=1ji+1k=ijP(wk)Pmin(wk)Pmax(wk)Pmin(wk)+ϵ\theta_\mu^{\text{rel}}(w,i,j) = \frac{1}{j-i+1} \sum_{k=i}^{j} \frac{P(w_k) - P_{min}(w_k)}{P_{max}(w_k) - P_{min}(w_k) + \epsilon}

where ϵ>0\epsilon > 0 ensures numerical stability. A higher PSC value implies greater model propensity or confidence in generating the given smelly code snippet.

2. Estimation and Computation Procedure

PSC computation for an LLM-generated code sample involves the following steps:

  • Token-Level Probabilities: During autoregressive decoding, token-level logits k\ell_k are converted to probabilities by:

P(wkw1,...,wk1)=exp(k[wk])vexp(k[v])P(w_k \mid w_1,...,w_{k-1}) = \frac{\exp(\ell_k[w_k])}{\sum_v \exp(\ell_k[v])}

  • Span Alignment: A static analyzer identifies a smell instance μ\mu with a character range, mapped to token indices (i,j)(i,j) by alignment function δμ(w)\delta_\mu(w).
  • Score Aggregation: Compute mean, median, or relative PSC over tokens [i,j][i,j]. Median and relative aggregations are most commonly used.
  • Thresholding (for Interpretability): A smell μ\mu may be labeled “likely” if θμλ\theta_\mu \geq \lambda, typically with λ=0.5\lambda = 0.5. This threshold is applied post hoc, not as part of the score itself.

3. Empirical Validation and Robustness of PSC

Two complementary validations establish PSC’s utility as a structural code quality measure:

  • Semantic-Preserving Transformations (SECT): Six statement-level edits (e.g., Add2Equal, SwitchRelation) are applied, and θμrel\theta_\mu^{\text{rel}} computed for original and mutated snippets. ANOVA on logit-transformed scores reveals that 31/41 analyzed smells (76%) exhibit statistical stability (p1.0p \approx 1.0, η20\eta^2 \approx 0) under syntax-preserving edits. Smells linked to naming/formatting (e.g., invalid-name) naturally show higher sensitivity.
  • Information Gain (IG): Using a binary severity label S{low,high}S \in \{\text{low}, \text{high}\} (defined by token-level flag proportion), PSC demonstrates higher information gain IG(S;X)=H(S)H(SX)IG(S;X) = H(S) - H(S|X) than BLEU or CodeBLEU scores across all smell categories. This suggests PSC provides a more informative reduction of uncertainty regarding severe smells in generated code.

4. PSC as a Causal Instrument for Generation Analysis

PSC serves as the outcome variable YY in a structural causal model (SCM) designed to relate coding pipeline variables T1,...,T4T_1, ..., T_4 to smell propensity, accounting for confounders ZZ (LOC, AST size, token count, part-of-speech frequencies, etc.). In Pearl’s notation:

  • T1T_1 (decoding strategy: greedy, beam, top-k, top-p, contrastive, sampling)
  • T2T_2 (model size: 0.5B, 1.5B, 3B, 7B)
  • T3T_3 (architecture: CodeLlama, Mistral, Qwen, StarCoder)
  • T4T_4 (prompt formulation: minimal, generic “Complete code,” role-preamble, explicit smell-avoidance instructions)

ATE (Average Treatment Effect) for intervention T=tT=t is:

ATEt=E[Ydo(T=t)]E[Ydo(T=t0)]ATE_t = E[Y \mid do(T=t)] - E[Y \mid do(T=t_0)]

where t0t_0 is the default setting. Estimates are obtained via stratification/regression over ZZ, utilizing the doWhy library. Random confounders, placebo treatments, simulated unmeasured confounding, and subset stability refutation tests confirm robustness.

5. Experimental Findings: Factors Influencing PSC

A tabular summary (Table 3 in the cited study) condenses factor impacts. Positive ATE denotes increased PSC (worse structural quality), negative ATE denotes decreased PSC (improved quality):

Factor Effect on PSC Example Implications
Generation Strategy Sampling (top-k, top-p, contrastive) reduces PSC for semantic/refactor smells but increases PSC for surface-level formatting smells More randomness improves semantic quality but can worsen formatting
Model Size Scaling 0.5B → 7B yields negligible ATE ( ATE
Architecture Switching CodeLlama to Mistral/Qwen/StarCoder (same size) significantly lowers PSC (ATE up to −0.35) for warning/refactor smells Model design and pretraining strongly influence smell rates
Prompt Formulation Using explicit smell-avoidance prompt yields large ATE reductions (≈ −0.25...−0.40) for serious smells Prompt-engineering is the most practical and effective means for mitigation

This suggests prompt formulation is the strongest practical lever for reducing smell propensity without retraining, followed by architectural choices.

6. Mitigation Protocols Leveraging PSC

Supported by causal analysis, a prompt-engineering protocol utilizes PSC to selectively mitigate smell instances during inference. An outline:

  1. Start with base prompt and generate output.
  2. Use a static analyzer to detect smell spans.
  3. Compute PSC for each span. If PSC exceeds λ\lambda (e.g. 0.5), switch to a mitigation prompt:
    1
    2
    3
    
    You are an expert software engineer.
    Please complete the following code without introducing any of these smells: {list all smell types}.
    <smelly_function>
  4. Regenerate code as needed.

Empirically, appending the “do not introduce these smells” instruction reduced median PSC as follows: broad-exception-raised W0719 from 0.80→0.67, missing-final-newline C0304 from 0.76→0.72, unused-import W0611 from 0.52→0.23. A plausible implication is that such prompt-level interventions are highly actionable for practitioners, especially for high-propensity smells.

7. Human Interpretability and User Study Insights

A between-subjects survey (n=36) compared the impact of PSC versus standard textual smell labels. Key findings:

  • For unused-argument W0613, PSC increased perceived “importance to fix” and rater confidence.
  • For unused-variable W0612, PSC elevated the sense of “systematic model error.”
  • For redefined-builtin W0622, PSC heightened both “systematic” attribution and “importance to fix.”

Open responses revealed developers feel urgency and attention when PSC is high, and reassurance when PSC is low, with PSC acting as a heuristic for ambiguous or subtle smells otherwise easily overlooked. This suggests that PSC augments traditional smell labels, improving human judgment and triage efficacy.


Taken collectively, the Propensity Smelly Score is a fundamentally precise, likelihood-based measure for analyzing, explaining, and mitigating code smell risk in LLM-generated software. Its mathematical tractability, empirical robustness, utility in causal diagnostics, amenability to simple mitigation, and support for human interpretability highlight its significance in quality-aware assessment and deployment of generative models for code (Velasco et al., 19 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Propensity Smelly Score (PSC).