Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 456 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Prompt Stability in Code LLMs: Measuring Sensitivity across Emotion- and Personality-Driven Variations (2509.13680v1)

Published 17 Sep 2025 in cs.SE and cs.AI

Abstract: Code generation models are widely used in software development, yet their sensitivity to prompt phrasing remains under-examined. Identical requirements expressed with different emotions or communication styles can yield divergent outputs, while most benchmarks emphasize only peak performance. We present PromptSE (Prompt Sensitivity Evaluation), a framework that creates semantically equivalent prompt variants with emotion and personality templates, and that evaluates stability using probability aware continuous scoring or using binary pass rates when logits are unavailable. The results are aggregated into a proposed area under curve metric (AUC-E) for cross model comparison. Across 14 models from three families (Llama, Qwen, and DeepSeek), our study shows that performance and stability behave as largely decoupled optimization objectives, and it reveals architectural and scale related patterns that challenge common assumptions about model robustness. The framework supports rapid screening for closed-source models as well as detailed stability analysis in research settings. PromptSE enables practitioners to quantify performance stability trade offs for deployment and model selection, positioning prompt stability as a complementary evaluation dimension alongside performance and fairness, and contributing to more trustworthy AI-assisted software development tools.

Summary

  • The paper introduces PromptSE, a framework that quantifies prompt stability in code LLMs by evaluating sensitivity to affect-based variations.
  • It employs dual-pathway evaluation with probability-aware SoftExec and binary pass rates to produce AUC-E scores, revealing a decoupling between performance and stability.
  • The study finds that model scale and architecture impact prompt stability non-monotonically, offering practical insights for robust model selection in software engineering.

Prompt Stability in Code LLMs: A Systematic Evaluation of Sensitivity to Emotion- and Personality-Driven Variations

Introduction

This paper addresses a critical but underexplored aspect of code generation with LLMs: the stability of model outputs under semantically equivalent prompt variations, particularly those reflecting different emotional states and personality traits. The authors introduce PromptSE, a principled framework for quantifying prompt sensitivity, and propose AUC-E, a standardized metric for cross-model stability comparison. The paper evaluates 14 models across three major architecture families (Llama, Qwen, DeepSeek) using a large-scale, template-driven prompt variant dataset, revealing that performance and stability are largely decoupled and that model scale does not monotonically improve robustness.

Methodological Framework

The core methodological innovation is the PromptSE framework, which operationalizes prompt stability as a measurable property. The approach consists of three main components:

  1. Emotion- and Personality-Aware Prompt Variant Generation: The authors design a template system grounded in affective computing and personality psychology, generating prompt variants that systematically vary in emotional tone (e.g., focused, excited, frustrated) and personality profile (technical orientation, experience level, collaboration style). Perturbation strength is controlled via a distance parameter (d{0.1,0.2,0.3}d \in \{0.1, 0.2, 0.3\}), ensuring semantic and interface invariance.
  2. Dual-Pathway Sensitivity Evaluation:
    • PromptSE (probability-aware): Utilizes SoftExec, a continuous correctness metric that weights outputs by model confidence, requiring access to output probabilities.
    • PromptSELight (probability-free): Relies on binary pass rates, enabling evaluation of closed-source models or rapid screening.
  3. Elasticity and AUC-E Computation: Elasticity quantifies the stability of model outputs across prompt variants, and AUC-E aggregates elasticity across perturbation distances into a single, interpretable [0,1] stability score.

Experimental Design

The evaluation uses the HumanEval Python benchmark, augmented with 14,760 prompt variants generated via the template system. Each prompt variant is tested with 16 independent model outputs, and all models are evaluated under identical decoding policies and test suites. The model set spans a range of parameter scales and training paradigms, including instruction tuning and distillation.

Results

Decoupling of Performance and Stability

The analysis demonstrates that performance (Pass@1) and stability (AUC-E) are not significantly negatively correlated (Spearman ρ=0.433\rho = -0.433, p=0.122p = 0.122). Models are distributed across all four quadrants of the performance-stability space, indicating that high performance does not guarantee robustness and vice versa. Figure 1

Figure 1

Figure 1: Relationship between performance and stability, showing the distribution of models across the performance-stability space.

Scale and Architecture Effects

Prompt stability does not monotonically improve with model scale. Notably, the Qwen-1.5B model (Tiny group) achieves the highest AUC-E ($0.646$), outperforming larger models in stability. Family-specific patterns are observed, with Llama models exhibiting a balance between performance and stability, while Qwen models tend toward higher performance but lower robustness. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Overall sensitivity by size group (ΔPass|\Delta\text{Pass}|), illustrating non-monotonic scale effects and family-specific stability patterns.

Sensitivity to Prompt Perturbation

Elasticity curves and SoftExec-based analyses reveal that most models maintain high stability under light to moderate prompt perturbations, but model-specific vulnerabilities emerge at higher perturbation distances. The average ΔPass|\Delta\text{Pass}| remains within a narrow range ($0.078$–$0.082$) across distances, but significant main effects of scale and architecture are detected (Kruskal–Wallis H=49.663H=49.663, p=0.000p=0.000).

PromptSELight as a Practical Proxy

PromptSELight provides a strong approximation to PromptSE at the model level (Pearson r=0.717r=0.717, Spearman ρ=0.723\rho=0.723), enabling rapid, resource-efficient stability screening for closed-source models. However, some models exhibit rank drift due to differences in confidence calibration. Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: RQ3: Model-level AUC-E: PromptSELight vs PromptSE, demonstrating strong monotonic consistency between the two evaluation methods.

Emotional and Personality Effects

Analysis of valence and arousal dimensions shows that emotional coloring in prompts can induce confidence miscalibration and subtle correctness shifts, with effects varying by model family. High-arousal, negative-valence prompts are particularly diagnostic of model brittleness. Calibration metrics (ECE) and confidence bias analyses further differentiate models with similar correctness but divergent confidence behaviors.

Implications

Practical Deployment

The findings have direct implications for model selection and deployment in software engineering contexts. The decoupling of performance and stability enables practitioners to select models tailored to specific reliability requirements. The dual-pathway evaluation supports both detailed research analysis and rapid industrial screening.

Theoretical Insights

The non-monotonic relationship between scale and stability suggests that prompt robustness is not an emergent property of larger models and may require explicit optimization. The systematic impact of emotional and personality-driven prompt variations highlights the need for robustness evaluation beyond traditional functional correctness metrics.

Future Directions

Potential avenues for future research include:

  • Investigating architectural and training factors that contribute to prompt stability.
  • Extending the framework to other programming languages and task domains.
  • Developing adaptive prompting strategies and robustness-aware training objectives.

Conclusion

This work establishes prompt stability as a distinct, quantifiable dimension for evaluating code LLMs, complementing traditional performance metrics. The PromptSE framework and AUC-E metric provide a unified methodology for cross-model comparison, revealing that performance and stability are decoupled and that smaller models can exhibit superior robustness. The dual-pathway evaluation enables both comprehensive research analysis and practical deployment screening. These insights position prompt stability alongside performance, fairness, and safety as a core consideration in the development and deployment of trustworthy AI-assisted software engineering tools.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Youtube Logo Streamline Icon: https://streamlinehq.com