Papers
Topics
Authors
Recent
2000 character limit reached

DSPy+HELM Benchmarking Framework

Updated 1 December 2025
  • DSPy+HELM framework is a unified benchmarking pipeline that combines modular prompt optimization with holistic evaluation to estimate language model performance ceilings.
  • It replaces fixed prompts with DSPy programs featuring multiple modules, including explicit chain-of-thought traces and Bayesian optimization for robust, reproducible assessments.
  • The framework reduces performance variability and bias by applying systematic prompt search and structured prompting across diverse benchmarks and domains.

The DSPy+HELM framework is a unified benchmarking pipeline for LMs that integrates structured and automated prompt optimization into the Holistic Evaluation of LLMs (HELM) suite. By extending HELM to support declarative prompting and systematic prompt search via DSPy, the framework enables robust estimation of each model’s performance ceiling—the maximum achievable score across prompt variants—yielding empirically grounded, less variable, and more decision-informative benchmarks across domains.

1. System Architecture and Programmatic Structure

HELM’s baseline design evaluates each model–benchmark pair with a single, fixed prompt pbasep_{\mathrm{base}}. DSPy+HELM generalizes this by replacing the one-shot prompt with a modular DSPy program, Φ\Phi, comprising mm modules. Each module ii possesses a prompt template pip_i parameterized by open slots (variables) ViV_i, which include instruction text and up to KiK_i in-context examples. The aggregate variable set is V=iViV = \bigcup_i V_i.

Assignments, denoted VSV \to S, instantiate each variable vVv \in V with a string sSs \in S. For given inputs xx, execution is notated as: ΦVS(x)=LM(p1[S],p2[S],,pm[S],x).\Phi_{V \to S}(x) = \text{LM}\bigl( p_1[S],\, p_2[S], \dots, p_m[S],\, x \bigr). The HELM driver enumerates assignment collections per prompting method, runs ΦVS\Phi_{V \to S} on each test example, and computes standard HELM metrics on resultant outputs (Aali et al., 25 Nov 2025).

2. Prompt Optimization Methodologies

The framework comprises four core prompting protocols:

Method Structure Distinctive Elements
HELM Baseline Single prompt, no CoT Handcrafted pbasep_{\mathrm{base}}
Zero-Shot Predict DSPy "Predict," K=0K=0 demos, baseline instructions Template modularization without added structure
Zero-Shot CoT DSPy ChainOfThought module, REASONING + OUTPUT Explicit reasoning traces elicited before answer
BFRS Few-shot, random demo search, batch validation Bootstrapped pools Bi\mathcal{B}_i of high-score examples
MIPROv2 Joint Bayes-opt over instructions and demos Tree-structured Parzen Estimator (TPE) based search

HELM Baseline: Uses pbasep_{\mathrm{base}} with no CoT augmentation.

Zero-Shot Predict: Adopts the same instruction as pbasep_{\mathrm{base}}, exposes the model to the modular DSPy template without additional exemplars.

Zero-Shot CoT: Applies a two-field template: INPUTS: x REASONING:_ OUTPUT:_\begin{aligned} &\text{INPUTS:}\ x\ &\text{REASONING:}\quad \_\ &\text{OUTPUT:}\quad \_ \end{aligned} instructing the model to output an explicit chain-of-thought trace.

BFRS (Bootstrap Few-Shot + Random Search): Constructs demo pools Bi\mathcal{B}_i from seed program outputs on training set DtrD_{\mathrm{tr}}, selecting input–output pairs with score τ\geq \tau. Random demo combinations undergo RR trials, with each configuration’s minibatch score J^B(v)\widehat{J}_B(\mathbf{v}) computed. The highest-scoring configuration v\mathbf{v}^* is selected.

MIPROv2 (Bayesian Optimization): Expands upon BFRS, proposing TiT_i instruction candidates per module (Ii\mathcal{I}_i) via LM-generated candidates. TPE models select candidates maximizing (v)/g(v)\ell(\mathbf{v})/g(\mathbf{v}) (ratio of poor-to-good prior densities), with periodic full validation splits determining the best overall assignment.

3. Ceiling Performance Estimation

Performance ceiling is formalized as the supremal metric achieved under all prompt assignments: Φ=argmaxVS1D(x,y)Dμ(ΦVS(x),y)\Phi^* = \arg\max_{V \to S} \frac{1}{|D|}\sum_{(x,y) \in D} \mu\bigl( \Phi_{V \to S}(x), y \bigr) where DD denotes the test set and μ\mu the evaluation metric per benchmark.

Operationally, the ceiling is estimated by running the set of structured prompting methods—Zero-Shot CoT, BFRS, and MIPROv2—and retaining the maximal attained score per model/benchmark: maxmethod{ZSC,BFRS,MIPRO}scoremethod\max_{\text{method} \in \{ \text{ZSC}, \text{BFRS}, \text{MIPRO} \}} \text{score}_\text{method} Ceiling estimation enables more accurate, less prompt-design-biased assessment of model capabilities (Aali et al., 25 Nov 2025).

4. Benchmark Scope and Quantitative Outcomes

The framework is applied to seven HELM benchmarks spanning general and medical domains:

  • MMLU-Pro (multi-task reasoning, 1,000 samples)
  • GPQA (graduate-level QA, 446)
  • GSM8K (grade-school math, 1,000)
  • MedCalc-Bench (clinical calculation, 1,000)
  • Medec (error-classification, 597)
  • HeadQA (USMLE-style MCQ, 1,000)
  • MedBullets (USMLE-style MCQ, 308)

Metrics employed include exact-match and within-range correctness as specified by HELM. For each (model, benchmark, prompt), per-benchmark accuracy and the standard deviation σ\sigma across benchmarks are reported. Performance improvements and variability reductions are summarized by macro-averaging across all settings and LMs.

Empirical findings:

  • HELM baseline underestimates model performance by +4+4 percentage points (pp) on average.
  • Across-benchmark σ\sigma shrinks by 2\sim 2 pp with structured prompting.
  • Model leaderboards experience rank inversions on 3/7 benchmarks at ceiling (e.g., Claude 3.7 Sonnet surpasses o3 Mini on MMLU-Pro).
  • The top-two model gap contracts from 6\sim 6 pp (baseline) to 3 pp (ceiling).

5. Reasoning Elicitation and Robustness Properties

Structured prompting—especially the inclusion of explicit chain-of-thought (CoT) modules—directs the LM to emit a REASONING field prior to OUTPUT, inducing a latent variable τ\tau representing the reasoning trace. Formally, the factorization is: Pθ(τ,yx,p)=Pθ(τx,p)  Pθ(yτ,x,p)P_\theta(\tau, y \mid x, p) = P_\theta(\tau \mid x, p)\; P_\theta(y \mid \tau, x, p) and

Pθ(yx,p)=τPθ(τx,p)Pθ(yτ,x,p)P_\theta(y \mid x, p) = \sum_{\tau} P_\theta(\tau \mid x, p) P_\theta(y \mid \tau, x, p)

Theoretical analysis (via the data-processing and Pinsker’s inequalities) demonstrates that for any two prompts p,pp, p', the distance in predictive distributions is upper-bounded by the divergence in reasoning-trace distributions. Thus, if this divergence is less than half the decision margin m(x;p)m(x;p)—defined as the difference between the highest and next-highest predicted answer probability—the prediction is stable to prompt variations: m(x;p)=P(yx,p)maxyyP(yx,p)m(x; p) = P(y^* \mid x, p) - \max_{y \neq y^*} P(y \mid x, p) CoT enrichment increases m(x;p)m(x;p) by “averaging” over multiple valid reasoning pathways, thereby improving robustness to prompt formulation. Empirically, CoT configurations exhibit substantially reduced accuracy fluctuation (Δ\Delta) across prompt assignments relative to zero-CoT baselines.

6. Implementation, Open-Source Components, and Workflow

Key components include:

  • DSPy+HELM integration (HELM PR 3893): Incorporates DSPy execution (dspy.run(\Phi, config)) into HELM’s Python code and benchmarking loop.
  • Prompt Optimization Pipeline (github.com/StanfordMIMI/dspy-helm): Automates (a) DSPy program construction seeded from baseline prompts, (b) application of each prompting strategy, (c) model/benchmark/method evaluation, and (d) collection of results in standard HELM reporting format.

A high-level workflow is as follows:

1
2
3
4
5
6
7
8
9
10
11
12
for each model M in {Claude3.7, Gemini2.0, GPT-4o, o3 Mini}:
  for each benchmark B in {MMLU-Pro,…,MedBullets}:
    build DSPy program Φ_B seeded with HELM’s baseline prompt
    evaluate HELM Baseline: p_base → M → score
    for each method in {ZeroShotPredict, ZeroShotCoT, BFRS, MIPROv2}:
      Φ_config ← configure(Φ_B, method)
      if method in {BFRS, MIPROv2}:
        bootstrap demos from B.train
        optimize on B.val
      run Φ_config on B.test via M → score
    record all scores; compute ceiling = max_{methods} score
aggregate results; compute means, σ’s, ranking flips, token costs
Every pipeline stage—from prompt construction, through automated search, to metrics—operates via DSPy’s declarative API, ensuring full reproducibility. The integration of modular structured prompting and optimization within HELM thereby provides a scalable, systematic estimation of model performance ceilings, supporting more precise and actionable benchmarking outcomes (Aali et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to DSPy+HELM Framework.