PromptSELight: Evaluating Code Prompt Sensitivity
- PromptSELight is a binary protocol that measures LLM prompt sensitivity through pass rate variations using semantically identical but emotionally and personality diverse prompts.
- It employs controlled perturbations with emotion and personality templates to ensure semantic invariance while rigorously testing model stability.
- The protocol’s metrics, including elasticity and AUC-E, enable practitioners to assess performance-stability tradeoffs for reliable code generation in production settings.
PromptSELight is a binary pass rate–based protocol for quantifying prompt sensitivity in code generation LLMs under emotion- and personality-driven variations. It was introduced as a component of the PromptSE (Prompt Sensitivity Evaluation) framework, enabling rapid measurement of output stability when probability scores (logits) are unavailable, especially in closed-source or production model APIs (Ma et al., 17 Sep 2025).
1. Definition and Concept
PromptSELight evaluates the stability of LLM-generated code by measuring the variation in Pass@k (binary correctness rate) across sets of semantically equivalent prompts that differ systematically in emotional tone and personality profile. Given an original prompt , PromptSELight uses template-based rewriting (covering valence-arousal emotional dimensions and multiple personality facets, such as technical orientation and experience level) to produce versions that are functionally identical but stylistically diverse.
The sensitivity score is calculated as the absolute difference between the pass rate for the original prompt and the average pass rate of its variants. Unlike probability-aware continuous evaluation (PromptSE), PromptSELight is limited to pass/fail execution results.
2. Methodology of Prompt Variant Generation
PromptSELight uses controlled prompt perturbations along three distances (%%%%1%%%%), reflecting the degree of stylistic (emotional and personality) rewrites. These are constructed from:
- Eight emotion templates derived from affective computing theory (e.g., “excited”, “frustrated”).
- Personality axes (e.g., “novice” vs “expert”, “collaborative” vs “critical”).
- A rewriting engine that enforces semantic invariance by preserving all signature and interface constraints.
This ensures differences in output arise strictly from the model’s prompt sensitivity—not from altered input semantics or programming requirements.
3. Evaluation Metrics and Aggregation
The primary metrics employed are:
- Binary Pass Rate: The fraction of generated code samples from each prompt (original and variant) that pass predefined test cases.
- Elasticity: For distance , elasticity is computed as
where and are the pass rates for original and variant prompts, respectively.
- AUC-E (Area Under Curve of Elasticity): Quantifies overall stability across perturbation strengths using Simpson’s rule:
where is the dataset-level mean elasticity for distance . AUC-E is normalized to lie in .
These mechanisms standardize sensitivity comparison across models and prompt perturbation depths.
4. Empirical Findings Across Model Families
The PromptSE and PromptSELight protocols were applied to 14 models from three families (Llama, Qwen, DeepSeek). Notable findings include:
- Performance (e.g., Pass@1) and stability (AUC-E) are largely independent (Spearman correlation , ), demonstrating that high performance does not guarantee high prompt stability.
- Some smaller models (e.g., Qwen-1.5B) exhibit unexpectedly greater stability (AUC-E ) compared to much larger counterparts, indicating prompt stability is not a simple function of model scale.
- Models fall into four quadrants (high/low performance versus high/low stability), underscoring the need for explicit prompt robustness objectives.
These results highlight the value of binary pass rate–based sensitivity analysis for real-world model selection.
5. Practical Applications and Significance
PromptSELight supports both rapid screening (especially on closed-source APIs without access to token-level logits) and robust ranking of models for deployment scenarios where consistency is critical:
- Enables practitioners to quantify and compare model robustness to real-world emotional and stylistic prompt variations.
- Allows direct measurement of performance–stability tradeoffs for code LLMs.
- Facilitates deployment of AI-assisted software development tools by identifying models with superior prompt-insensitivity—critical for reducing unpredictable behavior in production.
It positions prompt stability alongside accuracy and fairness as a key model selection criterion.
6. Mathematical Formulation
The underlying formulas of PromptSELight as documented include:
- Binary Pass Rate Sensitivity for a prompt and variant :
- Elasticity aggregation for perturbation sets at distance :
- Area Under Curve of Elasticity (AUC-E) for cross-model comparison:
These quantifications are consistent across the binary (PromptSELight) and continuous (PromptSE) variants, allowing direct benchmarking and practical operational integration.
7. Implications and Future Research
PromptSELight demonstrates that prompt robustness must be measured explicitly and cannot be inferred from performance or model size alone. Its binary sensitivity metric allows scalable deployment and practical evaluation in closed settings. Future work may involve refining the rewriting templates, extending the methodology to additional domains (beyond code), and integrating stability assessment into continuous model monitoring pipelines.
By systematically exposing and quantifying sensitivity to realistic communication-style variations, PromptSELight advances the field of trustworthy, stable AI-assisted programming and model evaluation for industrial and research settings (Ma et al., 17 Sep 2025).