Codebook-Guided Empirical Prompt Selection

Updated 10 December 2025

The paper introduces CEPS, a novel framework that guides prompt selection using human-crafted codebooks and empirical F1 evaluation.
It details a robust multi-phase human-in-the-loop validation process, enhancing transparency and reproducibility in LLM classifications.
Empirical results demonstrate that CEPS outperforms traditional strategies by improving label alignment and minimizing biases in complex constructs.

Codebook-Guided Empirical Prompt Selection (CEPS) is a systematic methodology for optimizing prompt design and selection in LLM classification tasks, particularly where human–machine label alignment is critical and constructs possess precise, theory-driven definitions not reliably encoded in model pretraining. CEPS integrates codebook construction principles from qualitative research with empirical evaluation pipelines, aiming for transparency, objectivity, and replicability in prompt engineering while minimizing ad-hoc bias and irreproducibility (Anglin et al., 3 Dec 2025, Shah, 2024).

1. Formal Definition and Algorithmic Structure

CEPS reframes prompt selection as an empirical optimization over a discrete set of prompt candidates, each composed by systematically combining elements from a human-crafted codebook. Let $D_\text{train}$ be the set of labeled training texts, with gold-standard annotations. The prompt space $P = \{p_1, \ldots, p_N\}$ is generated by concatenating variants of construct definitions, task instructions, and inclusion/exclusion criteria.

Each prompt $p \in P$ is evaluated by running zero-shot classification on $D_\text{train}$ and computing:

$TP(p)$ : true positives
$FP(p)$ : false positives
$FN(p)$ : false negatives
$TN(p)$ : true negatives

The empirical score for prompt $p$ is $S(p) \equiv F_1(p) = \frac{2 TP(p)}{2 TP(p) + FP(p) + FN(p)}$ . The optimal prompt is selected via:

$p^* = \arg\max_{p \in P} S(p)$

Pseudocode for CEPS prompt search:

L = []
for i in range(N):
    d = sample_uniform(D)
    k = sample_uniform(range(0, len(C)+1))
    C_k = sample_subset(C, k)
    t = sample_uniform(T)
    p_i = concatenate(d, C_k, t)
    TP, FP, FN = evaluate_prompt(p_i, D_train)
    S_p = compute_F1(TP, FP, FN)
    L.append((p_i, S_p))
p_star = argmax(L, key=lambda x: x[1])
return p_star

This facilitates both empirical and transparent optimization, with prompt changes and evaluation rigorously documented.

2. Multi-Phase Human-in-the-Loop Validation

CEPS can be contextualized within a broader codebook-guided empirical prompt selection methodology comprising four phases (Shah, 2024):

Initial prompt candidate generation: Draft a baseline prompt and obtain initial model outputs on a pilot set.
Response codebook construction/validation: Assemble ≥2 domain-trained assessors, apply an initial codebook, independently label LLM outputs, and iteratively refine definitions through deliberation to meet a reliability threshold (e.g., Cohen’s $\kappa \geq 0.70$ ).
Prompt codebook refinement and tuning: Using the validated response codebook, independently assess new model outputs, monitor inter-rater reliability and prompt effectiveness rate (PER), and revise prompt instructions as needed.
Pipeline verification: Hold-out test evaluation with fresh assessors, monitoring if reliability and effectiveness thresholds persist.

Quantitative metrics include Cohen’s $\kappa$ for inter-rater agreement, Krippendorff’s $\alpha$ for multiple raters, and PER for output quality:

Cohen’s $\kappa = \frac{p_o - p_e}{1 - p_e}$ , where $p_o$ is observed agreement, $p_e$ expected by chance.
PER $= \frac{\text{Outputs meeting all criteria}}{\text{Total outputs}} \times 100\%$

When either metric falls below threshold, further refinement and iteration are mandated.

3. Critical Prompt Features Determining Performance

Empirical findings indicate the most influential prompt features in CEPS and related frameworks:

Construct definition wording: Small phrasal variations can yield substantial shifts in LLM classification behavior (e.g., “generalized negative judgments...” vs. “broad negative beliefs…”).
Task framing: Instruction format (e.g., binary response vs. class assignment) can change output reliability.
Inclusion/exclusion criteria: Bullet-pointed edge case qualifiers systematically clarify class boundaries, improving inter-rater alignment.

Additive prompt elements—few-shot demonstrations, chain-of-thought reasoning, persona prefixes, and explanations—have less impact, though well-selected few-shot examples confer some robustness against poorly worded prompts (Anglin et al., 3 Dec 2025).

4. Empirical Benchmarking: Experimental Setup and Metrics

Experiments in psychology construct identification employed two main models: OpenAI GPT-4 (fixed temperature=0 for classification; temperature=1 for prompt generation) and Llama-3.3 (open-source, temperature=0). Datasets and splits:

Construct	Dataset	N (total)	Split (train/dev/test)
Gratitude	GoEmotions subset	600	25% / 50% / 25%
Negative Core Beliefs	Expressive-writing	565	25% / 50% / 25%
Meaning Making	Expressive-writing	589	25% / 50% / 25%

Metrics (standard notation):

Accuracy $= \frac{TP + TN}{TP + FP + TN + FN}$
Precision $= \frac{TP}{TP + FP}$
Recall $= \frac{TP}{TP + FN}$
$F_1 = \frac{2\cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} = \frac{2TP}{2TP + FP + FN}$

Statistical uncertainty by 1,000-fold nonparametric bootstrap ( $95\%$ CIs). Comparison significance annotated as $*\;p<0.10$ , $**\;p<0.05$ , $***\;p<0.01$ .

5. Quantitative Results: CEPS Relative to Alternative Strategies

CEPS outperformed baseline, automatic prompt engineering, and additive strategies in aligning machine labels with human experts. The following table summarizes development-set $F_1$ scores (GPT-4) (Anglin et al., 3 Dec 2025):

Construct	Bottom-ZS	Bottom-FS	CEPS-ZS	CEPS-FS
Gratitude	0.80	0.89***	0.85	0.90
Neg. Core Beliefs	0.52	0.70***	0.70	0.73
Meaning Making	0.53	0.67*	0.66	0.69

Automatic prompt engineering resulted in intermediate gains (e.g., Gratitude: $F_1=0.88^{**}$ ), but generally did not surpass codebook-guided selection. Persona and zero-shot chain-of-thought provided marginal, non-significant improvements. Few-shot chain-of-thought and explanations added minimal incremental benefit over ordinary few-shot examples.

Test-set performance for best CEPS+few-shot+auto-refinement prompt (GPT-4):

Metric	Gratitude	Neg. Core Beliefs	Meaning Making
Accuracy	0.89 [0.85–0.93]	0.81 [0.75–0.87]	0.92 [0.88–0.96]
Precision	0.86 [0.78–0.94]	0.74 [0.62–0.86]	0.68 [0.50–0.86]
Recall	0.92 [0.86–0.98]	0.67 [0.55–0.79]	0.88 [0.74–1.00]
$F_1$	0.89 [0.83–0.95]	0.70 [0.60–0.80]	0.76 [0.62–0.90]

This suggests CEPS yields the largest and most reliable improvements in construct classification alignment.

6. Practical Implementation Guidelines and Limitations

Practical deployment of CEPS comprises:

Codebook-guided baseline: Draft 3–5 construct definition variants, 3–5 instruction variants, list possible inclusion/exclusion bullet points; programmatically generate ~50 prompt combinations; empirically evaluate and select $p^*$ .
Automatic refinement (optional): Seed with $p^*$ ; generate variants; select each with maximal $F_1$ .
Few-shot selection (optional): Pool 50 candidate examples; sample 50 distinct sets; evaluate; select highest- $F_1$ .
Additive techniques (persona, chain-of-thought, explanations): Lower priority unless surplus development resources are available; typically do not outperform optimal few-shot prompts.
Final evaluation: Freeze prompt(s); evaluate on held-out set with Accuracy, Precision, Recall, $F_1$ plus $95\%$ CIs.

Limitations include increased labor and expertise demands (estimated 2–3× naïve engineering cost), dependency on human assessors, and potential codebook overfitting to training data distributions (Shah, 2024). Regular re-validation is recommended as data and LLM versions change.

7. Objectivity, Replicability, and Recommendations

Integrating codebook-guided procedures systematically embeds human consensus, removes undocumented bias, and ensures transparent documentation of all deliberations and prompt revisions. Other researchers can reuse domain-specific codebooks, annotated prompt templates, and deliberation logs, which strengthens objectivity and replicability (Shah, 2024). Key recommendations include:

Initiate with diverse pilot datasets for rapid codebook iteration.
Explicitly define reliability and effectiveness thresholds.
Share codebooks, prompt templates, and deliberation logs publicly.
Periodically re-validate as domain distributions and model APIs evolve.

The plausible implication is that CEPS offers a robust paradigm for theory-driven, empirically-validated LLM prompt engineering in scientific classification tasks requiring strong human–machine alignment.

Markdown Report Issue Upgrade to Chat

References (2)

Improving Alignment Between Human and Machine Codes: An Empirical Assessment of Prompt Engineering for Construct Identification in Psychology (2025)

From Prompt Engineering to Prompt Science With Human in the Loop (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Codebook-Guided Empirical Prompt Selection.

Codebook-Guided Empirical Prompt Selection

1. Formal Definition and Algorithmic Structure

2. Multi-Phase Human-in-the-Loop Validation

3. Critical Prompt Features Determining Performance

4. Empirical Benchmarking: Experimental Setup and Metrics

5. Quantitative Results: CEPS Relative to Alternative Strategies

6. Practical Implementation Guidelines and Limitations

7. Objectivity, Replicability, and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Codebook-Guided Empirical Prompt Selection

1. Formal Definition and Algorithmic Structure

2. Multi-Phase Human-in-the-Loop Validation

3. Critical Prompt Features Determining Performance

4. Empirical Benchmarking: Experimental Setup and Metrics

5. Quantitative Results: CEPS Relative to Alternative Strategies

6. Practical Implementation Guidelines and Limitations

7. Objectivity, Replicability, and Recommendations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research