LangProBe Benchmark for Modular LMs

Updated 17 January 2026

LangProBe is a large-scale benchmark that evaluates modular language model systems by combining diverse tasks, program architectures, prompt optimizers, and language models.
It uses a Cartesian framework over four axes to generate over 2000 unique configurations, enabling precise comparisons of cost and quality tradeoffs.
Empirical findings show that optimized language programs significantly boost performance and reduce costs, emphasizing the need for systematic evaluation in LM systems.

LangProBe is a large-scale benchmark designed for the rigorous evaluation of modular LLM (LM) systems composed as multi-step "language programs," together with prompt optimizers. It systematically examines the quality–cost tradeoffs arising from choices of language program architecture, prompt optimization, and underlying LMs across a diverse suite of tasks, constituting the first comprehensive measurement-driven study in this space. LangProBe encompasses over 2000 unique combinations of tasks, program architectures, optimization strategies, and models, and is accompanied by plans for open-sourcing code and evaluation data (Tan et al., 27 Feb 2025).

1. Benchmark Definition and Structure

LangProBe formalizes the evaluation space in terms of a Cartesian product over four discrete axes:

Task Set ( $T$ ): Fifteen datasets, each with a specific input–output format and evaluation metric, spanning six categories: Code, Reasoning, Agent, Knowledge-Intensive QA, Classification, and Math.
Program Architectures ( $A$ ): Distinct modular language programs, each defining a directed acyclic graph (DAG) over core DSPy modules such as Predict, Retriever, and Critic. Examples include Direct Call (DSPy Predict), Chain-of-Thought (CoT), RAG, GenCriticRanker, GenCriticFuser, Simplified Baleen, ReAct variants, and voting or ranking ensembles specialized for extreme classification.
Prompt Optimizers ( $O$ ): Four strategies—BootstrapFewShot, BootstrapFewShotRandomSearch, MIPROv2 (Bayesian search over instructions + demonstrations), and RuleInfer (rule induction from successful exemplars).
LLMs ( $M$ ): Six contemporary models: OpenAI gpt-4o, gpt-4o-mini, o1-mini, Meta Llama3.1-8B-Instruct, Llama3.2-3B-Instruct, and Llama3.3-70B-Instruct.

Each configuration is a 4-tuple $(t, \Phi, o, m) \in T \times A \times O \times M$ , representing a dataset, program, optimizer, and model instance, respectively.

Task Category	Datasets	Evaluation Metric
Code	HumanEval, SWEUnderspecified, SWEValidity	pass@1
Reasoning	JudgeBench, Scone	accuracy/pairwise preference
Agent	AppWorld	task-specific
Knowledge-Intensive	MMLU, HoVer, IReRa, HotpotQA, HotpotQAConditional, RAGQAArena	accuracy, rank-precision, F1
Classification	HeartDisease, Iris	accuracy
Math	MATH, GSM8K	integer exact-match

2. Methodological Components

2.1 Program Architecture Formalism

A language program $\Phi \in A$ specifies a structured composition of LM calls using DSPy modules. Programs may instantiate single-step calls (Direct), sequential reasoning (CoT), multi-stage generation–critique–ranking (Archon-style), retrieval hybridization (RAG, Baleen), agentic interleaving (ReAct), or ensembling (CoTBasedVote). Formally, the architecture is a DAG, precisely defining the invocation structure and call count per input.

2.2 Prompt Optimization Strategies

Each optimizer $o \in O$ operates over programs, with fixed hyperparameters across all runs:

BootstrapFewShot: Greedily selects up to $N_1$ few-shot exemplars.
BootstrapFewShotRandomSearch: Randomized search in the demonstration space.
MIPROv2: Bayesian optimization over concatenated instruction and demonstration spaces, optionally with a "strong" internal model for proposal evaluation.
RuleInfer: Induces logical rules from successful demonstrations contextually appended to prompts; see Algorithm 1:

$\begin{aligned} & \text{Input: program }\Phi,\,\text{train set }X,\,\text{val set }X',\,N_{\rm cand}\ & \Phi^*_0 \;=\; \text{BootstrapFewShot}(\Phi;X)\ & \mu^*_0 \;=\; \text{Eval}(\Phi^*_0;X')\ & \text{for }n=1\ldots N_{\rm cand}\text{ do}\ &\quad \Phi_n \;=\; \Phi^*_{n-1}\;\cup\;\text{InduceRules}(\Phi^*_{n-1},X)\ &\quad \mu_n \;=\; \text{Eval}(\Phi_n;X')\ &\quad \text{if }\mu_n>\mu^*_{n-1}\text{ then }(\Phi^*_n,\mu^*_n)\leftarrow(\Phi_n,\mu_n)\ &\quad\text{else }(\Phi^*_n,\mu^*_n)\leftarrow(\Phi^*_{n-1},\mu^*_{n-1})\ & \text{return }\Phi^*_N \end{aligned}$

2.3 Metrics and Cost–Quality Frontiers

Each candidate configuration is evaluated on:

Cost ( $C$ ): Monetary inference cost per input, aggregated over all LM calls, computed as $C = \sum_{c \in \Phi} [\alpha_m \cdot \text{tokens}_\text{in}(c) + \beta_m \cdot \text{tokens}_\text{out}(c)]$ with model-dependent token pricing ( $\alpha_m, \beta_m$ ).
Quality ( $Q$ ): Task-native metric, normalized to $[0,1]$ or $[0\%, 100\%]$ (e.g., pass@1, exact-match, accuracy, rank-precision, F1).
Pareto Frontier ( $PF$ ): Bi-objective set of non-dominated $(C_i, Q_i)$ pairs such that no configuration obtains both lower cost and higher or equal quality, with at least one strict inequality:

$PF = \{ (C_i, Q_i) : \nexists j \ [C_j \leq C_i, Q_j \geq Q_i, (C_j < C_i \lor Q_j > Q_i)] \}$

Aggregate plots visualize the convex hull on log-cost vs. quality axes.

3. Empirical Findings and Illustrative Examples

The primary empirical finding is that optimized language programs (combining program architecture and prompt optimizer) systematically offer strong Pareto improvements over direct raw LM calls.

Aggregate Analysis: The configuration "Model + Program + Optimizer" strictly dominates "Model alone," "Model + Program," and "Model + Optimizer" in cost–quality space. For instance, gpt-4o-mini with optimal program and optimizer achieves +11.68 points aggregate quality compared to gpt-4o raw at half the monetary cost; it also matches or surpasses gpt-4o + program at approximately 10% of the cost.
HotpotQA Case Study: On the HotpotQA multi-document QA task, the baseline gpt-4o raw achieves $\approx$ 62% quality at normalized cost. The combination "Simplified Baleen + MIPROv2 on gpt-4o-mini" yields $\approx$ 83% quality (+33.9 points) at 0.82× cost.
GSM8K Case Study: For GSM8K, gpt-4o-mini plus BootstrapFewShotRandomSearch attains $\approx$ 75% quality vs. 69% for unoptimized gpt-4o-mini, also reducing cost by approximately 5%.
HeartDisease Example: Notably, on the binary classification task HeartDisease, the combination Llama3.2-3B-Instruct with GenCriticFuser (K=5) and MIPROv2 optimizer (internal proposer gpt-4o-mini, 50 trials) achieves $\approx$ $\approx$ 76.3% accuracy, an increase from the 26.3% unoptimized baseline. The program executes as follows:
- Generator: Produces 5 candidate diagnosis chains.
- Critic: Annotates strengths/weaknesses of each.
- Fuser: Aggregates into the final decision and rationale.

4. Comparative Effectiveness and Tradeoffs

The relative frequency and quality impact of optimizers show:

MIPROv2 (and its "MIPROv2-T" variant) achieves the highest prevalence of near-optimal results (within 3% of best).
BootstrapFewShotRandomSearch serves as a robust, simple baseline.
RuleInfer excels for rule-driven tasks (e.g., classification, some code challenges) but demonstrates overfitting on open-ended QA.

Across the full benchmark, at the 90th percentile, all optimizers deliver sizable relative quality gains (80–122%) over unoptimized baselines; median improvements span 6.3–18.1%. Occasional degradation (<0%) is observed when overfitting occurs through induced rules or few-shot selection.

No single program architecture or optimizer is universally dominant, which underscores the necessity of empirically guided configuration choice. This result is evident in both aggregate and per-task analyses.

5. Limitations

Current limitations of LangProBe include:

Coverage restricted to 15 datasets and 10+ program architectures; limited agentic benchmarks (no SWE-bench) and program-of-thought variants.
Supported LMs are limited to six (absence of o3-mini, DeepSeek-R1).
Cost metric is strictly monetary (USD, token-based); latency, energy, privacy, and infrastructure-specific costs are not evaluated.

This suggests that real-world deployments may require additional evaluation criteria beyond those used in LangProBe.

6. Future Research Directions

Proposed future advancements and open challenges include:

Expansion to broader taxonomies of program architectures, such as tree-of-thought and self-ask, and richer task types (e.g., large-scale code generation, multimodal reasoning).
Exploration of new optimization paradigms, including reinforcement learning (RL)-based prompt tuning and module-level fine-tuning of language programs.
Development of automated configuration agents (e.g., "AutoLangPro") to recommend optimal program architectures and optimizers for novel tasks.
Incorporation of comprehensive cost models explicitly capturing latency, carbon emissions, and on-device versus remote inference costs.

The open-sourcing of LangProBe artifacts is intended to accelerate innovation and reproducibility in modular, cost-efficient LM system design.

7. Summary and Significance

LangProBe establishes that structuring LMs as modular language programs, jointly optimized for prompt content, enables substantial improvements in both output quality and computational cost, often allowing smaller or less expensive LMs to outperform larger counterparts. The empirical landscape demonstrates that neither program architecture nor optimizer is universally optimal, reinforcing the critical role of systematic benchmarking and empirical evaluation. LangProBe constitutes a foundational resource for researchers seeking to design, optimize, and compare modular LM-centric systems on rigorous, principled grounds (Tan et al., 27 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

LangProBe: a Language Programs Benchmark (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LangProBe Benchmark.

LangProBe Benchmark for Modular LMs

1. Benchmark Definition and Structure

2. Methodological Components

2.1 Program Architecture Formalism

2.2 Prompt Optimization Strategies

2.3 Metrics and Cost–Quality Frontiers

3. Empirical Findings and Illustrative Examples

4. Comparative Effectiveness and Tradeoffs

5. Limitations

6. Future Research Directions

7. Summary and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

LangProBe Benchmark for Modular LMs

1. Benchmark Definition and Structure

2. Methodological Components

2.1 Program Architecture Formalism

2.2 Prompt Optimization Strategies

2.3 Metrics and Cost–Quality Frontiers

3. Empirical Findings and Illustrative Examples

4. Comparative Effectiveness and Tradeoffs

5. Limitations

6. Future Research Directions

7. Summary and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research