LangProBe Benchmark for Modular LMs
- LangProBe is a large-scale benchmark that evaluates modular language model systems by combining diverse tasks, program architectures, prompt optimizers, and language models.
- It uses a Cartesian framework over four axes to generate over 2000 unique configurations, enabling precise comparisons of cost and quality tradeoffs.
- Empirical findings show that optimized language programs significantly boost performance and reduce costs, emphasizing the need for systematic evaluation in LM systems.
LangProBe is a large-scale benchmark designed for the rigorous evaluation of modular LLM (LM) systems composed as multi-step "language programs," together with prompt optimizers. It systematically examines the quality–cost tradeoffs arising from choices of language program architecture, prompt optimization, and underlying LMs across a diverse suite of tasks, constituting the first comprehensive measurement-driven study in this space. LangProBe encompasses over 2000 unique combinations of tasks, program architectures, optimization strategies, and models, and is accompanied by plans for open-sourcing code and evaluation data (Tan et al., 27 Feb 2025).
1. Benchmark Definition and Structure
LangProBe formalizes the evaluation space in terms of a Cartesian product over four discrete axes:
- Task Set (): Fifteen datasets, each with a specific input–output format and evaluation metric, spanning six categories: Code, Reasoning, Agent, Knowledge-Intensive QA, Classification, and Math.
- Program Architectures (): Distinct modular language programs, each defining a directed acyclic graph (DAG) over core DSPy modules such as Predict, Retriever, and Critic. Examples include Direct Call (DSPy Predict), Chain-of-Thought (CoT), RAG, GenCriticRanker, GenCriticFuser, Simplified Baleen, ReAct variants, and voting or ranking ensembles specialized for extreme classification.
- Prompt Optimizers (): Four strategies—BootstrapFewShot, BootstrapFewShotRandomSearch, MIPROv2 (Bayesian search over instructions + demonstrations), and RuleInfer (rule induction from successful exemplars).
- LLMs (): Six contemporary models: OpenAI gpt-4o, gpt-4o-mini, o1-mini, Meta Llama3.1-8B-Instruct, Llama3.2-3B-Instruct, and Llama3.3-70B-Instruct.
Each configuration is a 4-tuple , representing a dataset, program, optimizer, and model instance, respectively.
| Task Category | Datasets | Evaluation Metric |
|---|---|---|
| Code | HumanEval, SWEUnderspecified, SWEValidity | pass@1 |
| Reasoning | JudgeBench, Scone | accuracy/pairwise preference |
| Agent | AppWorld | task-specific |
| Knowledge-Intensive | MMLU, HoVer, IReRa, HotpotQA, HotpotQAConditional, RAGQAArena | accuracy, rank-precision, F1 |
| Classification | HeartDisease, Iris | accuracy |
| Math | MATH, GSM8K | integer exact-match |
2. Methodological Components
2.1 Program Architecture Formalism
A language program specifies a structured composition of LM calls using DSPy modules. Programs may instantiate single-step calls (Direct), sequential reasoning (CoT), multi-stage generation–critique–ranking (Archon-style), retrieval hybridization (RAG, Baleen), agentic interleaving (ReAct), or ensembling (CoTBasedVote). Formally, the architecture is a DAG, precisely defining the invocation structure and call count per input.
2.2 Prompt Optimization Strategies
Each optimizer operates over programs, with fixed hyperparameters across all runs:
- BootstrapFewShot: Greedily selects up to few-shot exemplars.
- BootstrapFewShotRandomSearch: Randomized search in the demonstration space.
- MIPROv2: Bayesian optimization over concatenated instruction and demonstration spaces, optionally with a "strong" internal model for proposal evaluation.
- RuleInfer: Induces logical rules from successful demonstrations contextually appended to prompts; see Algorithm 1:
2.3 Metrics and Cost–Quality Frontiers
Each candidate configuration is evaluated on:
- Cost (): Monetary inference cost per input, aggregated over all LM calls, computed as with model-dependent token pricing ().
- Quality (): Task-native metric, normalized to or (e.g., pass@1, exact-match, accuracy, rank-precision, F1).
- Pareto Frontier (): Bi-objective set of non-dominated pairs such that no configuration obtains both lower cost and higher or equal quality, with at least one strict inequality:
Aggregate plots visualize the convex hull on log-cost vs. quality axes.
3. Empirical Findings and Illustrative Examples
The primary empirical finding is that optimized language programs (combining program architecture and prompt optimizer) systematically offer strong Pareto improvements over direct raw LM calls.
- Aggregate Analysis: The configuration "Model + Program + Optimizer" strictly dominates "Model alone," "Model + Program," and "Model + Optimizer" in cost–quality space. For instance, gpt-4o-mini with optimal program and optimizer achieves +11.68 points aggregate quality compared to gpt-4o raw at half the monetary cost; it also matches or surpasses gpt-4o + program at approximately 10% of the cost.
- HotpotQA Case Study: On the HotpotQA multi-document QA task, the baseline gpt-4o raw achieves 62% quality at normalized cost. The combination "Simplified Baleen + MIPROv2 on gpt-4o-mini" yields 83% quality (+33.9 points) at 0.82× cost.
- GSM8K Case Study: For GSM8K, gpt-4o-mini plus BootstrapFewShotRandomSearch attains 75% quality vs. 69% for unoptimized gpt-4o-mini, also reducing cost by approximately 5%.
- HeartDisease Example: Notably, on the binary classification task HeartDisease, the combination Llama3.2-3B-Instruct with GenCriticFuser (K=5) and MIPROv2 optimizer (internal proposer gpt-4o-mini, 50 trials) achieves 76.3% accuracy, an increase from the 26.3% unoptimized baseline. The program executes as follows:
- Generator: Produces 5 candidate diagnosis chains.
- Critic: Annotates strengths/weaknesses of each.
- Fuser: Aggregates into the final decision and rationale.
4. Comparative Effectiveness and Tradeoffs
The relative frequency and quality impact of optimizers show:
- MIPROv2 (and its "MIPROv2-T" variant) achieves the highest prevalence of near-optimal results (within 3% of best).
- BootstrapFewShotRandomSearch serves as a robust, simple baseline.
- RuleInfer excels for rule-driven tasks (e.g., classification, some code challenges) but demonstrates overfitting on open-ended QA.
Across the full benchmark, at the 90th percentile, all optimizers deliver sizable relative quality gains (80–122%) over unoptimized baselines; median improvements span 6.3–18.1%. Occasional degradation (<0%) is observed when overfitting occurs through induced rules or few-shot selection.
No single program architecture or optimizer is universally dominant, which underscores the necessity of empirically guided configuration choice. This result is evident in both aggregate and per-task analyses.
5. Limitations
Current limitations of LangProBe include:
- Coverage restricted to 15 datasets and 10+ program architectures; limited agentic benchmarks (no SWE-bench) and program-of-thought variants.
- Supported LMs are limited to six (absence of o3-mini, DeepSeek-R1).
- Cost metric is strictly monetary (USD, token-based); latency, energy, privacy, and infrastructure-specific costs are not evaluated.
This suggests that real-world deployments may require additional evaluation criteria beyond those used in LangProBe.
6. Future Research Directions
Proposed future advancements and open challenges include:
- Expansion to broader taxonomies of program architectures, such as tree-of-thought and self-ask, and richer task types (e.g., large-scale code generation, multimodal reasoning).
- Exploration of new optimization paradigms, including reinforcement learning (RL)-based prompt tuning and module-level fine-tuning of language programs.
- Development of automated configuration agents (e.g., "AutoLangPro") to recommend optimal program architectures and optimizers for novel tasks.
- Incorporation of comprehensive cost models explicitly capturing latency, carbon emissions, and on-device versus remote inference costs.
The open-sourcing of LangProBe artifacts is intended to accelerate innovation and reproducibility in modular, cost-efficient LM system design.
7. Summary and Significance
LangProBe establishes that structuring LMs as modular language programs, jointly optimized for prompt content, enables substantial improvements in both output quality and computational cost, often allowing smaller or less expensive LMs to outperform larger counterparts. The empirical landscape demonstrates that neither program architecture nor optimizer is universally optimal, reinforcing the critical role of systematic benchmarking and empirical evaluation. LangProBe constitutes a foundational resource for researchers seeking to design, optimize, and compare modular LM-centric systems on rigorous, principled grounds (Tan et al., 27 Feb 2025).