- The paper demonstrates that multi-shot prompting significantly improved agreement for Claude Haiku, highlighting the importance of targeted prompt calibration.
- It employs repeated-run experiments with Cohen's kappa analysis to assess reliability, intra-model stability, and systematic categorical biases in qualitative coding.
- Findings indicate that while LLMs can augment human qualitative coding, persistent biases and output variance challenge full automation in software engineering research.
Prompt Engineering for LLM-Based Qualitative Coding of Psychological Safety in SE Communities
Introduction
This paper systematically evaluates prompt engineering strategies for leveraging LLMs in qualitative coding of psychological safety statements within software engineering (SE) communities. It addresses the empirical reliability and reproducibility of LLM-based closed coding, considering varying prompt designs and assessing three state-of-the-art conversational models: Claude Haiku, DeepSeek-Chat, and Gemini 2.5. Flash. The evaluation is grounded in human-annotated data from Stack Exchange, employing both zero-shot and multi-shot prompting modes. Through ten independent executions per configuration, the study quantifies coding agreement with humans, intra-model stability, and categorical prediction bias.
Experimental Protocol
The methodological design is driven by four central research questions: (1) assessment of LLM-human agreement, (2) analysis of prompt design effects, (3) scrutiny of intra-model output stability, and (4) evaluation of systematic categorical biases. The reference annotation corpus comprises 116 quotes from SE-focused Stack Exchange communities, manually coded into seven Edmondson behavioral categories. Prompts were tightly controlled, with zero-shot (no examples) and multi-shot (7 per-category examples) versions, and three LLMs were evaluated with temperature fixed at zero to minimize sampling noise—though APIs did not allow full determinism.
Agreement was measured using Cohen's kappa, with additional per-class F1 and explicit bias ratio analysis. Repeated runs enabled statistical significance testing (Wilcoxon signed-rank) and direct quantification of output variance. All methodological choices were made to align with reproducible, rigorous software engineering research conventions.
Coding Agreement and Prompting Effects
Empirical agreement between LLM outputs and human annotation was consistently fair to moderate across the three models (kappa range: 0.33–0.44). Claude Haiku and Gemini 2.5. Flash showed notable sensitivity to prompt enrichment: for Claude Haiku, mean kappa improved by +0.034 with multi-shot prompting (p=0.004, Cohen’s d=2.41, effect size due to perfect directional consistency across runs), transitioning from fair to moderate agreement. DeepSeek-Chat was unaffected by the prompt engineering intervention (Δkappa = –0.001).
Per-class F1 analysis revealed strong dependency on frequency and semantic distinctiveness. "Disagreeing with Suggestions or Ideas" consistently exhibited the highest F1 (0.58–0.70); rare or semantically overlapping categories, such as "Sharing Negative Feedback" and "Admitting Mistakes," were persistently under-predicted, with F1 rarely exceeding 0.4. Micro-averaged metrics thus obscure critical deficiencies in minority/ambiguous class handling.
Model Stability and Multi-Run Necessity
Stability, measured as SD of kappa across runs, revealed meaningful inter-model and inter-configuration differences. Claude Haiku and DeepSeek-Chat exhibited low variance (~0.017), but Gemini 2.5. Flash displayed substantial instability (SD up to 0.038), with single-run values shifting across fair and moderate agreement regimes. Multi-shot prompting further stabilized Claude Haiku output (SD drop from 0.018 to 0.011), but did not materially affect DeepSeek-Chat or Gemini 2.5. Flash. These results directly invalidate single-run evaluation practices and advocate strongly for standardized multi-run protocols, since run-to-run output differences can dominate effect sizes in unstable deployments.
Categorical Bias in Qualitative Coding
Bias analysis exposes pronounced and systematic errors in LLM class distributions. All models severely over-predict "Sharing Negative Feedback"—by up to 5.25 times the gold standard rate—while consistently under-predicting "Expressing Concerns," the majority class. Multi-shot prompting mitigates but does not eliminate these biases. The observed pattern (SNF, DAE over-predicted; EC under-predicted) persists across models and prompt configurations, suggesting strong model-level semantic priors, not mere prompt artifacts. Furthermore, these overlap-driven confusions reveal model limitations in disambiguating critical social phenomena, which is particularly concerning for applied SE research.
Methodological Implications and Guidelines
The results yield clear, actionable recommendations for empirical qualitative coding workflows employing LLMs:
- Model-specific prompt calibration is essential: Benefits from multi-shot prompting are non-universal; empirical sensitivity analysis by target LLM is required.
- Multi-run evaluation must become standard practice: Single-run metrics, especially in models with high variance (e.g., Gemini 2.5. Flash), are fundamentally misleading.
- Category-level bias analysis is critical, particularly for datasets with imbalanced or semantically related classes. Relying on aggregate agreement hides distributional failures.
- Rare category performance metrics must be foregrounded: Macro-averaged or overall scores can conceal model collapse on classes of high substantive interest.
Given persistent model-level categorical biases, LLMs are not yet suitable for fully autonomous qualitative psychological safety coding. However, with careful prompt design, repeated-run aggregation, and explicit interpretive bias monitoring, they can augment, but not replace, rigorous human analysis.
Theoretical and Practical Implications
From a theoretical standpoint, the findings clarify that LLMs' reproducibility and categorical alignment with human qualitative reasoning remain limited in complex, socially nuanced coding tasks. The non-uniform impact of prompt engineering across models indicates that prompt effectiveness is highly model-dependent—a clear direction for further investigation.
Practically, these results reinforce that LLMs should be deployed as scaffolding in human-centric workflows, not as autonomous analytic agents. The detailed bias analysis warns against mechanistic adoption of LLM labels in empirical SE research, as systematic misclassification of critical behavioral categories could compromise substantive findings and intervention design.
Future Directions
The study motivates several research trajectories:
- Integrating chain-of-thought or contrastive prompting to further probe semantic categorization limits.
- Increasing sample diversity to test robustness across SE subcommunities, industrial or cross-cultural settings, and alternative codebooks.
- Systematic inter-LLM agreement mapping to explore convergence properties beyond the gold standard.
- Developing mitigation strategies for model-level semantic priors, particularly those fueling recurring categorical biases.
Conclusion
This controlled empirical assessment demonstrates that contemporary LLMs, under both baseline and enriched prompt regimes, can partially replicate human qualitative coding of psychological safety but are constrained by non-trivial agreement ceilings, high variance (for some models), and pronounced categorical bias. Prompt engineering yields significant gains for select models (notably Claude Haiku) but not universally. These findings mandate more nuanced, contextually aware deployment and evaluation of LLM-driven qualitative coding in SE research, and necessitate further methodological refinement before full automation is credible.
For full reproducibility, all experimental artifacts, prompts, and code are publicly released alongside this study (2605.07422).