Papers
Topics
Authors
Recent
Search
2000 character limit reached

Codebook-Guided Empirical Prompt Selection

Updated 10 December 2025
  • The paper introduces CEPS, a novel framework that guides prompt selection using human-crafted codebooks and empirical F1 evaluation.
  • It details a robust multi-phase human-in-the-loop validation process, enhancing transparency and reproducibility in LLM classifications.
  • Empirical results demonstrate that CEPS outperforms traditional strategies by improving label alignment and minimizing biases in complex constructs.

Codebook-Guided Empirical Prompt Selection (CEPS) is a systematic methodology for optimizing prompt design and selection in LLM classification tasks, particularly where human–machine label alignment is critical and constructs possess precise, theory-driven definitions not reliably encoded in model pretraining. CEPS integrates codebook construction principles from qualitative research with empirical evaluation pipelines, aiming for transparency, objectivity, and replicability in prompt engineering while minimizing ad-hoc bias and irreproducibility (Anglin et al., 3 Dec 2025, Shah, 2024).

1. Formal Definition and Algorithmic Structure

CEPS reframes prompt selection as an empirical optimization over a discrete set of prompt candidates, each composed by systematically combining elements from a human-crafted codebook. Let DtrainD_\text{train} be the set of labeled training texts, with gold-standard annotations. The prompt space P={p1,,pN}P = \{p_1, \ldots, p_N\} is generated by concatenating variants of construct definitions, task instructions, and inclusion/exclusion criteria.

Each prompt pPp \in P is evaluated by running zero-shot classification on DtrainD_\text{train} and computing:

  • TP(p)TP(p): true positives
  • FP(p)FP(p): false positives
  • FN(p)FN(p): false negatives
  • TN(p)TN(p): true negatives

The empirical score for prompt pp is S(p)F1(p)=2TP(p)2TP(p)+FP(p)+FN(p)S(p) \equiv F_1(p) = \frac{2 TP(p)}{2 TP(p) + FP(p) + FN(p)}. The optimal prompt is selected via:

p=argmaxpPS(p)p^* = \arg\max_{p \in P} S(p)

Pseudocode for CEPS prompt search:

1
2
3
4
5
6
7
8
9
10
11
12
L = []
for i in range(N):
    d = sample_uniform(D)
    k = sample_uniform(range(0, len(C)+1))
    C_k = sample_subset(C, k)
    t = sample_uniform(T)
    p_i = concatenate(d, C_k, t)
    TP, FP, FN = evaluate_prompt(p_i, D_train)
    S_p = compute_F1(TP, FP, FN)
    L.append((p_i, S_p))
p_star = argmax(L, key=lambda x: x[1])
return p_star
This facilitates both empirical and transparent optimization, with prompt changes and evaluation rigorously documented.

2. Multi-Phase Human-in-the-Loop Validation

CEPS can be contextualized within a broader codebook-guided empirical prompt selection methodology comprising four phases (Shah, 2024):

  1. Initial prompt candidate generation: Draft a baseline prompt and obtain initial model outputs on a pilot set.
  2. Response codebook construction/validation: Assemble ≥2 domain-trained assessors, apply an initial codebook, independently label LLM outputs, and iteratively refine definitions through deliberation to meet a reliability threshold (e.g., Cohen’s κ0.70\kappa \geq 0.70).
  3. Prompt codebook refinement and tuning: Using the validated response codebook, independently assess new model outputs, monitor inter-rater reliability and prompt effectiveness rate (PER), and revise prompt instructions as needed.
  4. Pipeline verification: Hold-out test evaluation with fresh assessors, monitoring if reliability and effectiveness thresholds persist.

Quantitative metrics include Cohen’s κ\kappa for inter-rater agreement, Krippendorff’s α\alpha for multiple raters, and PER for output quality:

  • Cohen’s κ=pope1pe\kappa = \frac{p_o - p_e}{1 - p_e}, where pop_o is observed agreement, pep_e expected by chance.
  • PER =Outputs meeting all criteriaTotal outputs×100%= \frac{\text{Outputs meeting all criteria}}{\text{Total outputs}} \times 100\%

When either metric falls below threshold, further refinement and iteration are mandated.

3. Critical Prompt Features Determining Performance

Empirical findings indicate the most influential prompt features in CEPS and related frameworks:

  • Construct definition wording: Small phrasal variations can yield substantial shifts in LLM classification behavior (e.g., “generalized negative judgments...” vs. “broad negative beliefs…”).
  • Task framing: Instruction format (e.g., binary response vs. class assignment) can change output reliability.
  • Inclusion/exclusion criteria: Bullet-pointed edge case qualifiers systematically clarify class boundaries, improving inter-rater alignment.

Additive prompt elements—few-shot demonstrations, chain-of-thought reasoning, persona prefixes, and explanations—have less impact, though well-selected few-shot examples confer some robustness against poorly worded prompts (Anglin et al., 3 Dec 2025).

4. Empirical Benchmarking: Experimental Setup and Metrics

Experiments in psychology construct identification employed two main models: OpenAI GPT-4 (fixed temperature=0 for classification; temperature=1 for prompt generation) and Llama-3.3 (open-source, temperature=0). Datasets and splits:

Construct Dataset N (total) Split (train/dev/test)
Gratitude GoEmotions subset 600 25% / 50% / 25%
Negative Core Beliefs Expressive-writing 565 25% / 50% / 25%
Meaning Making Expressive-writing 589 25% / 50% / 25%

Metrics (standard notation):

  • Accuracy =TP+TNTP+FP+TN+FN= \frac{TP + TN}{TP + FP + TN + FN}
  • Precision =TPTP+FP= \frac{TP}{TP + FP}
  • Recall =TPTP+FN= \frac{TP}{TP + FN}
  • F1=2PrecisionRecallPrecision+Recall=2TP2TP+FP+FNF_1 = \frac{2\cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}

Statistical uncertainty by 1,000-fold nonparametric bootstrap (95%95\% CIs). Comparison significance annotated as   p<0.10*\;p<0.10,   p<0.05**\;p<0.05,   p<0.01***\;p<0.01.

5. Quantitative Results: CEPS Relative to Alternative Strategies

CEPS outperformed baseline, automatic prompt engineering, and additive strategies in aligning machine labels with human experts. The following table summarizes development-set F1F_1 scores (GPT-4) (Anglin et al., 3 Dec 2025):

Construct Bottom-ZS Bottom-FS CEPS-ZS CEPS-FS
Gratitude 0.80 0.89*** 0.85 0.90
Neg. Core Beliefs 0.52 0.70*** 0.70 0.73
Meaning Making 0.53 0.67* 0.66 0.69

Automatic prompt engineering resulted in intermediate gains (e.g., Gratitude: F1=0.88F_1=0.88^{**}), but generally did not surpass codebook-guided selection. Persona and zero-shot chain-of-thought provided marginal, non-significant improvements. Few-shot chain-of-thought and explanations added minimal incremental benefit over ordinary few-shot examples.

Test-set performance for best CEPS+few-shot+auto-refinement prompt (GPT-4):

Metric Gratitude Neg. Core Beliefs Meaning Making
Accuracy 0.89 [0.85–0.93] 0.81 [0.75–0.87] 0.92 [0.88–0.96]
Precision 0.86 [0.78–0.94] 0.74 [0.62–0.86] 0.68 [0.50–0.86]
Recall 0.92 [0.86–0.98] 0.67 [0.55–0.79] 0.88 [0.74–1.00]
F1F_1 0.89 [0.83–0.95] 0.70 [0.60–0.80] 0.76 [0.62–0.90]

This suggests CEPS yields the largest and most reliable improvements in construct classification alignment.

6. Practical Implementation Guidelines and Limitations

Practical deployment of CEPS comprises:

  1. Codebook-guided baseline: Draft 3–5 construct definition variants, 3–5 instruction variants, list possible inclusion/exclusion bullet points; programmatically generate ~50 prompt combinations; empirically evaluate and select pp^*.
  2. Automatic refinement (optional): Seed with pp^*; generate variants; select each with maximal F1F_1.
  3. Few-shot selection (optional): Pool 50 candidate examples; sample 50 distinct sets; evaluate; select highest-F1F_1.
  4. Additive techniques (persona, chain-of-thought, explanations): Lower priority unless surplus development resources are available; typically do not outperform optimal few-shot prompts.
  5. Final evaluation: Freeze prompt(s); evaluate on held-out set with Accuracy, Precision, Recall, F1F_1 plus 95%95\% CIs.

Limitations include increased labor and expertise demands (estimated 2–3× naïve engineering cost), dependency on human assessors, and potential codebook overfitting to training data distributions (Shah, 2024). Regular re-validation is recommended as data and LLM versions change.

7. Objectivity, Replicability, and Recommendations

Integrating codebook-guided procedures systematically embeds human consensus, removes undocumented bias, and ensures transparent documentation of all deliberations and prompt revisions. Other researchers can reuse domain-specific codebooks, annotated prompt templates, and deliberation logs, which strengthens objectivity and replicability (Shah, 2024). Key recommendations include:

  • Initiate with diverse pilot datasets for rapid codebook iteration.
  • Explicitly define reliability and effectiveness thresholds.
  • Share codebooks, prompt templates, and deliberation logs publicly.
  • Periodically re-validate as domain distributions and model APIs evolve.

The plausible implication is that CEPS offers a robust paradigm for theory-driven, empirically-validated LLM prompt engineering in scientific classification tasks requiring strong human–machine alignment.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Codebook-Guided Empirical Prompt Selection.