DSPy+HELM Benchmarking Framework

Updated 1 December 2025

DSPy+HELM framework is a unified benchmarking pipeline that combines modular prompt optimization with holistic evaluation to estimate language model performance ceilings.
It replaces fixed prompts with DSPy programs featuring multiple modules, including explicit chain-of-thought traces and Bayesian optimization for robust, reproducible assessments.
The framework reduces performance variability and bias by applying systematic prompt search and structured prompting across diverse benchmarks and domains.

The DSPy+HELM framework is a unified benchmarking pipeline for LMs that integrates structured and automated prompt optimization into the Holistic Evaluation of LLMs (HELM) suite. By extending HELM to support declarative prompting and systematic prompt search via DSPy, the framework enables robust estimation of each model’s performance ceiling—the maximum achievable score across prompt variants—yielding empirically grounded, less variable, and more decision-informative benchmarks across domains.

1. System Architecture and Programmatic Structure

HELM’s baseline design evaluates each model–benchmark pair with a single, fixed prompt $p_{\mathrm{base}}$ . DSPy+HELM generalizes this by replacing the one-shot prompt with a modular DSPy program, $\Phi$ , comprising $m$ modules. Each module $i$ possesses a prompt template $p_i$ parameterized by open slots (variables) $V_i$ , which include instruction text and up to $K_i$ in-context examples. The aggregate variable set is $V = \bigcup_i V_i$ .

Assignments, denoted $V \to S$ , instantiate each variable $v \in V$ with a string $s \in S$ . For given inputs $x$ , execution is notated as: $\Phi_{V \to S}(x) = \text{LM}\bigl( p_1[S],\, p_2[S], \dots, p_m[S],\, x \bigr).$ The HELM driver enumerates assignment collections per prompting method, runs $\Phi_{V \to S}$ on each test example, and computes standard HELM metrics on resultant outputs (Aali et al., 25 Nov 2025).

2. Prompt Optimization Methodologies

The framework comprises four core prompting protocols:

Method	Structure	Distinctive Elements
HELM Baseline	Single prompt, no CoT	Handcrafted $p_{\mathrm{base}}$
Zero-Shot Predict	DSPy "Predict," $K=0$ demos, baseline instructions	Template modularization without added structure
Zero-Shot CoT	DSPy `ChainOfThought` module, REASONING + OUTPUT	Explicit reasoning traces elicited before answer
BFRS	Few-shot, random demo search, batch validation	Bootstrapped pools $\mathcal{B}_i$ of high-score examples
MIPROv2	Joint Bayes-opt over instructions and demos	Tree-structured Parzen Estimator (TPE) based search

HELM Baseline: Uses $p_{\mathrm{base}}$ with no CoT augmentation.

Zero-Shot Predict: Adopts the same instruction as $p_{\mathrm{base}}$ , exposes the model to the modular DSPy template without additional exemplars.

Zero-Shot CoT: Applies a two-field template: $\begin{aligned} &\text{INPUTS:}\ x\ &\text{REASONING:}\quad \_\ &\text{OUTPUT:}\quad \_ \end{aligned}$ instructing the model to output an explicit chain-of-thought trace.

BFRS (Bootstrap Few-Shot + Random Search): Constructs demo pools $\mathcal{B}_i$ from seed program outputs on training set $D_{\mathrm{tr}}$ , selecting input–output pairs with score $\geq \tau$ . Random demo combinations undergo $R$ trials, with each configuration’s minibatch score $\widehat{J}_B(\mathbf{v})$ computed. The highest-scoring configuration $\mathbf{v}^*$ is selected.

MIPROv2 (Bayesian Optimization): Expands upon BFRS, proposing $T_i$ instruction candidates per module ( $\mathcal{I}_i$ ) via LM-generated candidates. TPE models select candidates maximizing $\ell(\mathbf{v})/g(\mathbf{v})$ (ratio of poor-to-good prior densities), with periodic full validation splits determining the best overall assignment.

3. Ceiling Performance Estimation

Performance ceiling is formalized as the supremal metric achieved under all prompt assignments: $\Phi^* = \arg\max_{V \to S} \frac{1}{|D|}\sum_{(x,y) \in D} \mu\bigl( \Phi_{V \to S}(x), y \bigr)$ where $D$ denotes the test set and $\mu$ the evaluation metric per benchmark.

Operationally, the ceiling is estimated by running the set of structured prompting methods—Zero-Shot CoT, BFRS, and MIPROv2—and retaining the maximal attained score per model/benchmark: $\max_{\text{method} \in \{ \text{ZSC}, \text{BFRS}, \text{MIPRO} \}} \text{score}_\text{method}$ Ceiling estimation enables more accurate, less prompt-design-biased assessment of model capabilities (Aali et al., 25 Nov 2025).

4. Benchmark Scope and Quantitative Outcomes

The framework is applied to seven HELM benchmarks spanning general and medical domains:

MMLU-Pro (multi-task reasoning, 1,000 samples)
GPQA (graduate-level QA, 446)
GSM8K (grade-school math, 1,000)
MedCalc-Bench (clinical calculation, 1,000)
Medec (error-classification, 597)
HeadQA (USMLE-style MCQ, 1,000)
MedBullets (USMLE-style MCQ, 308)

Metrics employed include exact-match and within-range correctness as specified by HELM. For each (model, benchmark, prompt), per-benchmark accuracy and the standard deviation $\sigma$ across benchmarks are reported. Performance improvements and variability reductions are summarized by macro-averaging across all settings and LMs.

Empirical findings:

HELM baseline underestimates model performance by $+4$ percentage points (pp) on average.
Across-benchmark $\sigma$ shrinks by $\sim 2$ pp with structured prompting.
Model leaderboards experience rank inversions on 3/7 benchmarks at ceiling (e.g., Claude 3.7 Sonnet surpasses o3 Mini on MMLU-Pro).
The top-two model gap contracts from $\sim 6$ pp (baseline) to 3 pp (ceiling).

5. Reasoning Elicitation and Robustness Properties

Structured prompting—especially the inclusion of explicit chain-of-thought (CoT) modules—directs the LM to emit a REASONING field prior to OUTPUT, inducing a latent variable $\tau$ representing the reasoning trace. Formally, the factorization is: $P_\theta(\tau, y \mid x, p) = P_\theta(\tau \mid x, p)\; P_\theta(y \mid \tau, x, p)$ and

$P_\theta(y \mid x, p) = \sum_{\tau} P_\theta(\tau \mid x, p) P_\theta(y \mid \tau, x, p)$

Theoretical analysis (via the data-processing and Pinsker’s inequalities) demonstrates that for any two prompts $p, p'$ , the distance in predictive distributions is upper-bounded by the divergence in reasoning-trace distributions. Thus, if this divergence is less than half the decision margin $m(x;p)$ —defined as the difference between the highest and next-highest predicted answer probability—the prediction is stable to prompt variations: $m(x; p) = P(y^* \mid x, p) - \max_{y \neq y^*} P(y \mid x, p)$ CoT enrichment increases $m(x;p)$ by “averaging” over multiple valid reasoning pathways, thereby improving robustness to prompt formulation. Empirically, CoT configurations exhibit substantially reduced accuracy fluctuation ( $\Delta$ ) across prompt assignments relative to zero-CoT baselines.

6. Implementation, Open-Source Components, and Workflow

Key components include:

DSPy+HELM integration (HELM PR 3893): Incorporates DSPy execution (dspy.run(\Phi, config)) into HELM’s Python code and benchmarking loop.
Prompt Optimization Pipeline (github.com/StanfordMIMI/dspy-helm): Automates (a) DSPy program construction seeded from baseline prompts, (b) application of each prompting strategy, (c) model/benchmark/method evaluation, and (d) collection of results in standard HELM reporting format.

A high-level workflow is as follows:

for each model M in {Claude3.7, Gemini2.0, GPT-4o, o3 Mini}:
  for each benchmark B in {MMLU-Pro,…,MedBullets}:
    build DSPy program Φ_B seeded with HELM’s baseline prompt
    evaluate HELM Baseline: p_base → M → score
    for each method in {ZeroShotPredict, ZeroShotCoT, BFRS, MIPROv2}:
      Φ_config ← configure(Φ_B, method)
      if method in {BFRS, MIPROv2}:
        bootstrap demos from B.train
        optimize on B.val
      run Φ_config on B.test via M → score
    record all scores; compute ceiling = max_{methods} score
aggregate results; compute means, σ’s, ranking flips, token costs

Every pipeline stage—from prompt construction, through automated search, to metrics—operates via DSPy’s declarative API, ensuring full reproducibility. The integration of modular structured prompting and optimization within HELM thereby provides a scalable, systematic estimation of model performance ceilings, supporting more precise and actionable benchmarking outcomes (Aali et al., 25 Nov 2025).

Markdown Upgrade to Chat

References (1)

Structured Prompting Enables More Robust, Holistic Evaluation of Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DSPy+HELM Framework.