Codebook-Guided Empirical Prompt Selection
- The paper introduces CEPS, a novel framework that guides prompt selection using human-crafted codebooks and empirical F1 evaluation.
- It details a robust multi-phase human-in-the-loop validation process, enhancing transparency and reproducibility in LLM classifications.
- Empirical results demonstrate that CEPS outperforms traditional strategies by improving label alignment and minimizing biases in complex constructs.
Codebook-Guided Empirical Prompt Selection (CEPS) is a systematic methodology for optimizing prompt design and selection in LLM classification tasks, particularly where human–machine label alignment is critical and constructs possess precise, theory-driven definitions not reliably encoded in model pretraining. CEPS integrates codebook construction principles from qualitative research with empirical evaluation pipelines, aiming for transparency, objectivity, and replicability in prompt engineering while minimizing ad-hoc bias and irreproducibility (Anglin et al., 3 Dec 2025, Shah, 2024).
1. Formal Definition and Algorithmic Structure
CEPS reframes prompt selection as an empirical optimization over a discrete set of prompt candidates, each composed by systematically combining elements from a human-crafted codebook. Let be the set of labeled training texts, with gold-standard annotations. The prompt space is generated by concatenating variants of construct definitions, task instructions, and inclusion/exclusion criteria.
Each prompt is evaluated by running zero-shot classification on and computing:
- : true positives
- : false positives
- : false negatives
- : true negatives
The empirical score for prompt is . The optimal prompt is selected via:
Pseudocode for CEPS prompt search:
1 2 3 4 5 6 7 8 9 10 11 12 |
L = [] for i in range(N): d = sample_uniform(D) k = sample_uniform(range(0, len(C)+1)) C_k = sample_subset(C, k) t = sample_uniform(T) p_i = concatenate(d, C_k, t) TP, FP, FN = evaluate_prompt(p_i, D_train) S_p = compute_F1(TP, FP, FN) L.append((p_i, S_p)) p_star = argmax(L, key=lambda x: x[1]) return p_star |
2. Multi-Phase Human-in-the-Loop Validation
CEPS can be contextualized within a broader codebook-guided empirical prompt selection methodology comprising four phases (Shah, 2024):
- Initial prompt candidate generation: Draft a baseline prompt and obtain initial model outputs on a pilot set.
- Response codebook construction/validation: Assemble ≥2 domain-trained assessors, apply an initial codebook, independently label LLM outputs, and iteratively refine definitions through deliberation to meet a reliability threshold (e.g., Cohen’s ).
- Prompt codebook refinement and tuning: Using the validated response codebook, independently assess new model outputs, monitor inter-rater reliability and prompt effectiveness rate (PER), and revise prompt instructions as needed.
- Pipeline verification: Hold-out test evaluation with fresh assessors, monitoring if reliability and effectiveness thresholds persist.
Quantitative metrics include Cohen’s for inter-rater agreement, Krippendorff’s for multiple raters, and PER for output quality:
- Cohen’s , where is observed agreement, expected by chance.
- PER
When either metric falls below threshold, further refinement and iteration are mandated.
3. Critical Prompt Features Determining Performance
Empirical findings indicate the most influential prompt features in CEPS and related frameworks:
- Construct definition wording: Small phrasal variations can yield substantial shifts in LLM classification behavior (e.g., “generalized negative judgments...” vs. “broad negative beliefs…”).
- Task framing: Instruction format (e.g., binary response vs. class assignment) can change output reliability.
- Inclusion/exclusion criteria: Bullet-pointed edge case qualifiers systematically clarify class boundaries, improving inter-rater alignment.
Additive prompt elements—few-shot demonstrations, chain-of-thought reasoning, persona prefixes, and explanations—have less impact, though well-selected few-shot examples confer some robustness against poorly worded prompts (Anglin et al., 3 Dec 2025).
4. Empirical Benchmarking: Experimental Setup and Metrics
Experiments in psychology construct identification employed two main models: OpenAI GPT-4 (fixed temperature=0 for classification; temperature=1 for prompt generation) and Llama-3.3 (open-source, temperature=0). Datasets and splits:
| Construct | Dataset | N (total) | Split (train/dev/test) |
|---|---|---|---|
| Gratitude | GoEmotions subset | 600 | 25% / 50% / 25% |
| Negative Core Beliefs | Expressive-writing | 565 | 25% / 50% / 25% |
| Meaning Making | Expressive-writing | 589 | 25% / 50% / 25% |
Metrics (standard notation):
- Accuracy
- Precision
- Recall
Statistical uncertainty by 1,000-fold nonparametric bootstrap ( CIs). Comparison significance annotated as , , .
5. Quantitative Results: CEPS Relative to Alternative Strategies
CEPS outperformed baseline, automatic prompt engineering, and additive strategies in aligning machine labels with human experts. The following table summarizes development-set scores (GPT-4) (Anglin et al., 3 Dec 2025):
| Construct | Bottom-ZS | Bottom-FS | CEPS-ZS | CEPS-FS |
|---|---|---|---|---|
| Gratitude | 0.80 | 0.89*** | 0.85 | 0.90 |
| Neg. Core Beliefs | 0.52 | 0.70*** | 0.70 | 0.73 |
| Meaning Making | 0.53 | 0.67* | 0.66 | 0.69 |
Automatic prompt engineering resulted in intermediate gains (e.g., Gratitude: ), but generally did not surpass codebook-guided selection. Persona and zero-shot chain-of-thought provided marginal, non-significant improvements. Few-shot chain-of-thought and explanations added minimal incremental benefit over ordinary few-shot examples.
Test-set performance for best CEPS+few-shot+auto-refinement prompt (GPT-4):
| Metric | Gratitude | Neg. Core Beliefs | Meaning Making |
|---|---|---|---|
| Accuracy | 0.89 [0.85–0.93] | 0.81 [0.75–0.87] | 0.92 [0.88–0.96] |
| Precision | 0.86 [0.78–0.94] | 0.74 [0.62–0.86] | 0.68 [0.50–0.86] |
| Recall | 0.92 [0.86–0.98] | 0.67 [0.55–0.79] | 0.88 [0.74–1.00] |
| 0.89 [0.83–0.95] | 0.70 [0.60–0.80] | 0.76 [0.62–0.90] |
This suggests CEPS yields the largest and most reliable improvements in construct classification alignment.
6. Practical Implementation Guidelines and Limitations
Practical deployment of CEPS comprises:
- Codebook-guided baseline: Draft 3–5 construct definition variants, 3–5 instruction variants, list possible inclusion/exclusion bullet points; programmatically generate ~50 prompt combinations; empirically evaluate and select .
- Automatic refinement (optional): Seed with ; generate variants; select each with maximal .
- Few-shot selection (optional): Pool 50 candidate examples; sample 50 distinct sets; evaluate; select highest-.
- Additive techniques (persona, chain-of-thought, explanations): Lower priority unless surplus development resources are available; typically do not outperform optimal few-shot prompts.
- Final evaluation: Freeze prompt(s); evaluate on held-out set with Accuracy, Precision, Recall, plus CIs.
Limitations include increased labor and expertise demands (estimated 2–3× naïve engineering cost), dependency on human assessors, and potential codebook overfitting to training data distributions (Shah, 2024). Regular re-validation is recommended as data and LLM versions change.
7. Objectivity, Replicability, and Recommendations
Integrating codebook-guided procedures systematically embeds human consensus, removes undocumented bias, and ensures transparent documentation of all deliberations and prompt revisions. Other researchers can reuse domain-specific codebooks, annotated prompt templates, and deliberation logs, which strengthens objectivity and replicability (Shah, 2024). Key recommendations include:
- Initiate with diverse pilot datasets for rapid codebook iteration.
- Explicitly define reliability and effectiveness thresholds.
- Share codebooks, prompt templates, and deliberation logs publicly.
- Periodically re-validate as domain distributions and model APIs evolve.
The plausible implication is that CEPS offers a robust paradigm for theory-driven, empirically-validated LLM prompt engineering in scientific classification tasks requiring strong human–machine alignment.