Propensity Smelly Score (PSC) in LLM Code

Updated 19 January 2026

PSC is a probabilistic metric that aggregates token probabilities over code spans to quantify the likelihood of LLM-generated code smells.
It employs mean, median, and relative variants to provide interpretable signals and enable thresholding for practical smell mitigation.
Empirical validation and causal analysis show that PSC enhances prompt-engineering strategies and improves human interpretability in software quality triage.

The Propensity Smelly Score (PSC) is a probabilistic metric designed to quantify the likelihood of specific code smell instances emerging in code generated by LLMs. PSC aggregates next-token probabilities over code spans detected by static analysis tools, providing an interpretable signal of model “confidence” in emitting smelly code snippets. It offers a precise, likelihood-based measure of smell propensity, is empirically validated for robustness and informativeness, and supports causal analysis of LLM generation factors. PSC further enables practical mitigation of smells through prompt-engineering and acts as a heuristic for human interpretability in software quality triage (Velasco et al., 19 Nov 2025).

1. Mathematical Definition of Propensity Smelly Score

Let $w = (w_1, w_2, ..., w_n)$ denote the LLM-generated token sequence, and let a static analyzer (e.g., Pylint) associate a smell $\mu$ with the token span $(i,j)$ via alignment function $\delta_\mu(w)$ . Three PSC variants are formally defined:

Mean PSC:

$\theta_\mu^{\text{mean}}(w,i,j) = \frac{1}{j-i+1} \sum_{k=i}^{j} P(w_k \mid w_1, ..., w_{k-1})$

Median PSC:

$\theta_\mu^{\text{median}}(w,i,j) = \text{median}(\{ P(w_k \mid w_1, ..., w_{k-1}) : k = i...j \})$

Relative PSC (rescaled to $[0,1]$ using empirical minima $P_{min}(w_k)$ and maxima $P_{max}(w_k)$ ):

$\theta_\mu^{\text{rel}}(w,i,j) = \frac{1}{j-i+1} \sum_{k=i}^{j} \frac{P(w_k) - P_{min}(w_k)}{P_{max}(w_k) - P_{min}(w_k) + \epsilon}$

where $\epsilon > 0$ ensures numerical stability. A higher PSC value implies greater model propensity or confidence in generating the given smelly code snippet.

2. Estimation and Computation Procedure

PSC computation for an LLM-generated code sample involves the following steps:

Token-Level Probabilities: During autoregressive decoding, token-level logits $\ell_k$ are converted to probabilities by:

$P(w_k \mid w_1,...,w_{k-1}) = \frac{\exp(\ell_k[w_k])}{\sum_v \exp(\ell_k[v])}$

Span Alignment: A static analyzer identifies a smell instance $\mu$ with a character range, mapped to token indices $(i,j)$ by alignment function $\delta_\mu(w)$ .
Score Aggregation: Compute mean, median, or relative PSC over tokens $[i,j]$ . Median and relative aggregations are most commonly used.
Thresholding (for Interpretability): A smell $\mu$ may be labeled “likely” if $\theta_\mu \geq \lambda$ , typically with $\lambda = 0.5$ . This threshold is applied post hoc, not as part of the score itself.

3. Empirical Validation and Robustness of PSC

Two complementary validations establish PSC’s utility as a structural code quality measure:

Semantic-Preserving Transformations (SECT): Six statement-level edits (e.g., Add2Equal, SwitchRelation) are applied, and $\theta_\mu^{\text{rel}}$ computed for original and mutated snippets. ANOVA on logit-transformed scores reveals that 31/41 analyzed smells (76%) exhibit statistical stability ( $p \approx 1.0$ , $\eta^2 \approx 0$ ) under syntax-preserving edits. Smells linked to naming/formatting (e.g., invalid-name) naturally show higher sensitivity.
Information Gain (IG): Using a binary severity label $S \in \{\text{low}, \text{high}\}$ (defined by token-level flag proportion), PSC demonstrates higher information gain $IG(S;X) = H(S) - H(S|X)$ than BLEU or CodeBLEU scores across all smell categories. This suggests PSC provides a more informative reduction of uncertainty regarding severe smells in generated code.

4. PSC as a Causal Instrument for Generation Analysis

PSC serves as the outcome variable $Y$ in a structural causal model (SCM) designed to relate coding pipeline variables $T_1, ..., T_4$ to smell propensity, accounting for confounders $Z$ (LOC, AST size, token count, part-of-speech frequencies, etc.). In Pearl’s notation:

$T_1$ (decoding strategy: greedy, beam, top-k, top-p, contrastive, sampling)
$T_2$ (model size: 0.5B, 1.5B, 3B, 7B)
$T_3$ (architecture: CodeLlama, Mistral, Qwen, StarCoder)
$T_4$ (prompt formulation: minimal, generic “Complete code,” role-preamble, explicit smell-avoidance instructions)

ATE (Average Treatment Effect) for intervention $T=t$ is:

$ATE_t = E[Y \mid do(T=t)] - E[Y \mid do(T=t_0)]$

where $t_0$ is the default setting. Estimates are obtained via stratification/regression over $Z$ , utilizing the doWhy library. Random confounders, placebo treatments, simulated unmeasured confounding, and subset stability refutation tests confirm robustness.

5. Experimental Findings: Factors Influencing PSC

A tabular summary (Table 3 in the cited study) condenses factor impacts. Positive ATE denotes increased PSC (worse structural quality), negative ATE denotes decreased PSC (improved quality):

Factor	Effect on PSC	Example Implications
Generation Strategy	Sampling (top-k, top-p, contrastive) reduces PSC for semantic/refactor smells but increases PSC for surface-level formatting smells	More randomness improves semantic quality but can worsen formatting
Model Size	Scaling 0.5B → 7B yields negligible ATE (	ATE
Architecture	Switching CodeLlama to Mistral/Qwen/StarCoder (same size) significantly lowers PSC (ATE up to −0.35) for warning/refactor smells	Model design and pretraining strongly influence smell rates
Prompt Formulation	Using explicit smell-avoidance prompt yields large ATE reductions (≈ −0.25...−0.40) for serious smells	Prompt-engineering is the most practical and effective means for mitigation

This suggests prompt formulation is the strongest practical lever for reducing smell propensity without retraining, followed by architectural choices.

6. Mitigation Protocols Leveraging PSC

Supported by causal analysis, a prompt-engineering protocol utilizes PSC to selectively mitigate smell instances during inference. An outline:

Start with base prompt and generate output.
Use a static analyzer to detect smell spans.

Compute PSC for each span. If PSC exceeds

\lambda

(e.g. 0.5), switch to a mitigation prompt:

1
2
3

You are an expert software engineer.
Please complete the following code without introducing any of these smells: {list all smell types}.
<smelly_function>

Regenerate code as needed.

Empirically, appending the “do not introduce these smells” instruction reduced median PSC as follows: broad-exception-raised W0719 from 0.80→0.67, missing-final-newline C0304 from 0.76→0.72, unused-import W0611 from 0.52→0.23. A plausible implication is that such prompt-level interventions are highly actionable for practitioners, especially for high-propensity smells.

7. Human Interpretability and User Study Insights

A between-subjects survey (n=36) compared the impact of PSC versus standard textual smell labels. Key findings:

For unused-argument W0613, PSC increased perceived “importance to fix” and rater confidence.
For unused-variable W0612, PSC elevated the sense of “systematic model error.”
For redefined-builtin W0622, PSC heightened both “systematic” attribution and “importance to fix.”

Open responses revealed developers feel urgency and attention when PSC is high, and reassurance when PSC is low, with PSC acting as a heuristic for ambiguous or subtle smells otherwise easily overlooked. This suggests that PSC augments traditional smell labels, improving human judgment and triage efficacy.

Taken collectively, the Propensity Smelly Score is a fundamentally precise, likelihood-based measure for analyzing, explaining, and mitigating code smell risk in LLM-generated software. Its mathematical tractability, empirical robustness, utility in causal diagnostics, amenability to simple mitigation, and support for human interpretability highlight its significance in quality-aware assessment and deployment of generative models for code (Velasco et al., 19 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

A Causal Perspective on Measuring, Explaining and Mitigating Smells in LLM-Generated Code (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Propensity Smelly Score (PSC).

Propensity Smelly Score (PSC) in LLM Code

1. Mathematical Definition of Propensity Smelly Score

2. Estimation and Computation Procedure

3. Empirical Validation and Robustness of PSC

4. PSC as a Causal Instrument for Generation Analysis

5. Experimental Findings: Factors Influencing PSC

6. Mitigation Protocols Leveraging PSC

7. Human Interpretability and User Study Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Propensity Smelly Score (PSC) in LLM Code

1. Mathematical Definition of Propensity Smelly Score

2. Estimation and Computation Procedure

3. Empirical Validation and Robustness of PSC

4. PSC as a Causal Instrument for Generation Analysis

5. Experimental Findings: Factors Influencing PSC

6. Mitigation Protocols Leveraging PSC

7. Human Interpretability and User Study Insights

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research