EleutherAI LM Evaluation Harness
- EleutherAI Language Model Evaluation Harness is an open-source toolkit designed to ensure reproducible, transparent, and comparable evaluations of autoregressive language models.
- It integrates task modules, standardized LM backends, and a unified orchestration system to compute metrics like perplexity, accuracy, and bits-per-byte over diverse datasets.
- The framework addresses methodological challenges by centralizing prompt templates, detailed logging, and uncertainty reporting, thereby enabling verifiable, apples-to-apples model comparisons.
The EleutherAI LLM Evaluation Harness (lm-eval) is an open-source library developed to address methodological issues in evaluating autoregressive LMs, including insufficient reproducibility, transparency, and comparability. Providing a centralized, extensible framework for task and dataset management, prompt templating, metric calculation, and experiment orchestration, lm-eval underpins modern standards of rigorous and transparent LM benchmarking across diverse tasks and datasets, minimizing methodological drift and enabling verifiable, apples-to-apples comparisons (Biderman et al., 2024).
1. Motivation and Foundational Principles
lm-eval was conceived to systematically address critical challenges in LM evaluation: reproducibility, transparency, and comparability. Reproducibility is achieved by unifying benchmark implementations and evaluation logic in a single codebase, eliminating inconsistencies arising from bespoke local re-implementations. Transparency is maintained through open YAML and Python descriptions of every prompt template, answer extraction heuristic, and metric implementation, each with explicit versioning. Comparability is enforced by running all baselines and new models through an identical code path, formalizing changes in prompt style, model interface, or metric as controlled, explicit diffs and preventing untracked methodological deviations (Biderman et al., 2024).
2. System Architecture and Data Flow
The lm-eval framework consists of three primary components:
- Task Modules: Responsible for dataset loading (via HuggingFace Datasets), prompt template definition (Jinja-style), mapping of each data example to one or more primitive “Requests” (
loglikelihood,loglikelihood_rolling,generate_until,multiple_choice), and post-processing of raw LM outputs into per-example scores and aggregated metrics. - LM Backends: Any autoregressive LM can be interfaced by wrapping its model and tokenizer to implement a standardized Request API:
loglikelihood(requests: List[ConditionalLoglikelihoodRequest]) → List[float]loglikelihood_rolling(requests: List[PerplexityRequest]) → List[float]generate_until(requests: List[GenerationRequest]) → List[str]multiple_choice(requests: List[MultipleChoiceRequest]) → List[int](optional shortcut)
- Orchestrator / CLI / Python API: Reads structured experiment configuration (YAML or JSON), instantiates task/model objects, iterates through dataset splits in batched form, dispatches Requests, processes results, and writes detailed outputs (metrics, standard errors, per-sample logs, task versions, seeds, code commit hash).
The experiment data and prediction flow may be summarized as:
1 2 3 4 5 6 7 8 9 10 |
for task_cfg in experiment_config.tasks: task = Task.from_config(task_cfg) for split in [validation, test]: examples = task.load_split(split) batches = chunk(examples, batch_size) for batch in batches: requests = task.build_requests(batch) raw_outs = model.dispatch(requests) scores = task.process_raw(raw_outs, batch) aggregate(scores) |
3. Task Coverage and Dataset Implementations
Out of the box, lm-eval provides over 50 task implementations, spanning a broad spectrum of evaluation regimes:
- Language Modeling / Perplexity: Benchmarks such as Wikitext-103, The Pile, C4, and OpenWebText2; metrics include bits-per-byte, token-level and byte-level perplexity, and both sliding-window and non-overlapping chunk evaluation strategies.
- Multiple-Choice QA: ARC (Easy & Challenge), MMLU (57 subjects), OpenBookQA, PIQA, HellaSwag, TruthfulQA, AGIEval, GPQA, supporting pre-tokenized choices and length- or byte-normalized scoring.
- Classification / NLI / NLU: SuperGLUE (BoolQ, CB, RTE, WiC, WSC, COPA, MultiRC, ReCoRD), WinoGrande, SocialIQA.
- Reading Comprehension: TriviaQA, NaturalQuestions, DROP.
- Mathematical Reasoning: MATH, GSM8K (with solution generation and unit test verification).
- Toxicity & Bias: RealToxicityPrompts.
Each dataset is paired with a canonical “doc_to_text” template (e.g., Question: {...}\nAnswer:), mappings from document to choice or target, preferred metric lists, and explicit version metadata to track prompt and scoring changes (Biderman et al., 2024).
4. Configuration Schema and Experiment Specification
Experiments are configured via YAML or JSON with explicit fields for:
- Grouping (e.g., for output tables)
- Task name (matched to implemented Tasks)
- Dataset path, name, and splits
- Output type (multiple choice, loglikelihood, generation, rolling loglikelihood)
- Prompt and target templates
- Few-shot configurations (split, sampler, number of shots)
- Metrics (list of metric, aggregation, and “higher is better” flags)
- Metadata including task version
Example configuration for ARC-Easy in “cloze” style:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
group: - ai2_arc task: arc_easy dataset_path: allenai/ai2_arc dataset_name: ARC-Easy output_type: multiple_choice training_split: train validation_split: validation test_split: test doc_to_text: "Question: {question}\nAnswer:" doc_to_target: "{choices.label.index(answerKey)}" doc_to_choice: "{choices.text}" metric_list: - metric: acc aggregation: mean higher_is_better: true metadata: version: 1.0 |
5. Evaluation Pipeline and Interfaces
lm-eval supports both a command-line interface and a Python API for experiment execution. Typical usage specifies the target model, tasks, batch size, seed, and output directory or JSON dump location:
- Command-line Example:
1 2 3 4 5 6 7 |
pip install lm-eval lm-eval \ --model facebook/opt-2.7b \ --task arc_easy,mmluzero \ --batch_size 8 \ --seed 42 \ --output_dir results/opt-2.7b |
- Python API Example:
1 2 3 4 5 6 7 8 |
from lm_eval import evaluator results = evaluator.simple_evaluate( model="facebook/opt-2.7b", tasks=["arc_easy","mmlu"], batch_size=8, seed=42, save_path="results/opt-2.7b.json") print(results) |
Output artifacts include per-task metrics, standard errors, task versions, random seed, and code commit hash. Per-sample logs are available for inspection, and all evaluation logic supports early exit/debugging via sample count limits (Biderman et al., 2024).
6. Reproducibility Practices and Extensibility
Reproducibility is enforced through several mechanisms:
- Random-seed control at global and per-few-shot-sampler levels
- Task versioning and explicit reporting of metadata.version in outputs
- Environment capture (code commit hash logged; environment files recommended)
- Configurable per-sample logging (raw prompts and model outputs)
- Statistical uncertainty reporting (bootstrapped or analytic standard error for every metric)
- Debugging features (sample number limits)
lm-eval is designed for extensibility:
- Adding New Tasks: By subclassing
lm_eval.base.Taskor providing a compliant YAML config, with supplied methods for doc-to-text, doc-to-choice (for multiple choice), doc-to-target, few-shot sampling, and per-example result processing. - Registering New LM Backends: By subclassing
lm_eval.base.LMand implementing the relevant Request API, then registering under a unique name.
7. Core Metrics and Notation
The framework implements and exposes a set of standard evaluation metrics, including:
- Perplexity:
where is the number of tokens.
- Bits-per-Byte (BPB):
with = number of bytes.
- Accuracy:
- Exact Match: 1 if generated string is identical to reference, else 0.
- F1 (for span-based QA):
- BLEU:
- ROUGE-L: Longest common subsequence–based F-score.
8. Empirical Use Cases, Lessons, and Pitfalls
lm-eval has played a key role in landmark evaluation studies:
- Multiprompt Evaluation (BigScience/PromptSource): Highlighted significant variance across prompt templates, emphasizing the necessity of prompt robustness checks.
- Prompt Sensitivity Analyses: For tasks such as ARC and MMLU, varying the prompt style (cloze vs. MMLU) for models like GPT-NeoX-20B, Llama-2-7B, and Falcon-7B resulted in accuracy shifts of 26–56% vs. 38–43% (±2% CI). lm-eval’s uniform task library is essential to avoid misleading, non-equivalent comparisons.
- Architecture Benchmarks: Supported benchmarking of novel architectures (e.g., Hyena, RWKV, RetNet, H3) by isolating architectural effects from evaluation artifacts.
- Open LLM Leaderboards: lm-eval facilitated integration of standardized tasks, reducing duplicated effort and promoting consistent model reporting.
Clear lessons have emerged:
- Always share code, prompt templates, and model outputs.
- Rerun baselines directly using lm-eval rather than copying results.
- Conduct qualitative sanity checks before large-scale runs.
- Report uncertainty (e.g., standard errors, CIs) and use multiple seeds or few-shot draws.
- Pin task and code versions, rigorously modularize benchmarks, and foster community contributions to a shared harness (Biderman et al., 2024).
9. Significance and Community Impact
In over three years of open-source deployment, lm-eval has become a cornerstone tool for the rigorous evaluation of LLMs, widely adopted for benchmarking, sensitivity analysis, and leaderboard construction. By enforcing methodological rigor—through versioned code, shared benchmarks, and explicit uncertainty reporting—lm-eval directly addresses long-standing pitfalls in LM evaluation, enabling transparent, reproducible, and extensible research practices throughout the field (Biderman et al., 2024).