Papers
Topics
Authors
Recent
Search
2000 character limit reached

EleutherAI LM Evaluation Harness

Updated 19 February 2026
  • EleutherAI Language Model Evaluation Harness is an open-source toolkit designed to ensure reproducible, transparent, and comparable evaluations of autoregressive language models.
  • It integrates task modules, standardized LM backends, and a unified orchestration system to compute metrics like perplexity, accuracy, and bits-per-byte over diverse datasets.
  • The framework addresses methodological challenges by centralizing prompt templates, detailed logging, and uncertainty reporting, thereby enabling verifiable, apples-to-apples model comparisons.

The EleutherAI LLM Evaluation Harness (lm-eval) is an open-source library developed to address methodological issues in evaluating autoregressive LMs, including insufficient reproducibility, transparency, and comparability. Providing a centralized, extensible framework for task and dataset management, prompt templating, metric calculation, and experiment orchestration, lm-eval underpins modern standards of rigorous and transparent LM benchmarking across diverse tasks and datasets, minimizing methodological drift and enabling verifiable, apples-to-apples comparisons (Biderman et al., 2024).

1. Motivation and Foundational Principles

lm-eval was conceived to systematically address critical challenges in LM evaluation: reproducibility, transparency, and comparability. Reproducibility is achieved by unifying benchmark implementations and evaluation logic in a single codebase, eliminating inconsistencies arising from bespoke local re-implementations. Transparency is maintained through open YAML and Python descriptions of every prompt template, answer extraction heuristic, and metric implementation, each with explicit versioning. Comparability is enforced by running all baselines and new models through an identical code path, formalizing changes in prompt style, model interface, or metric as controlled, explicit diffs and preventing untracked methodological deviations (Biderman et al., 2024).

2. System Architecture and Data Flow

The lm-eval framework consists of three primary components:

  1. Task Modules: Responsible for dataset loading (via HuggingFace Datasets), prompt template definition (Jinja-style), mapping of each data example to one or more primitive “Requests” (loglikelihood, loglikelihood_rolling, generate_until, multiple_choice), and post-processing of raw LM outputs into per-example scores and aggregated metrics.
  2. LM Backends: Any autoregressive LM can be interfaced by wrapping its model and tokenizer to implement a standardized Request API:
    • loglikelihood(requests: List[ConditionalLoglikelihoodRequest]) → List[float]
    • loglikelihood_rolling(requests: List[PerplexityRequest]) → List[float]
    • generate_until(requests: List[GenerationRequest]) → List[str]
    • multiple_choice(requests: List[MultipleChoiceRequest]) → List[int] (optional shortcut)
  3. Orchestrator / CLI / Python API: Reads structured experiment configuration (YAML or JSON), instantiates task/model objects, iterates through dataset splits in batched form, dispatches Requests, processes results, and writes detailed outputs (metrics, standard errors, per-sample logs, task versions, seeds, code commit hash).

The experiment data and prediction flow may be summarized as:

1
2
3
4
5
6
7
8
9
10
for task_cfg in experiment_config.tasks:
    task = Task.from_config(task_cfg)
    for split in [validation, test]:
        examples = task.load_split(split)
        batches = chunk(examples, batch_size)
        for batch in batches:
            requests = task.build_requests(batch)
            raw_outs = model.dispatch(requests)
            scores = task.process_raw(raw_outs, batch)
            aggregate(scores)
(Biderman et al., 2024)

3. Task Coverage and Dataset Implementations

Out of the box, lm-eval provides over 50 task implementations, spanning a broad spectrum of evaluation regimes:

  • Language Modeling / Perplexity: Benchmarks such as Wikitext-103, The Pile, C4, and OpenWebText2; metrics include bits-per-byte, token-level and byte-level perplexity, and both sliding-window and non-overlapping chunk evaluation strategies.
  • Multiple-Choice QA: ARC (Easy & Challenge), MMLU (57 subjects), OpenBookQA, PIQA, HellaSwag, TruthfulQA, AGIEval, GPQA, supporting pre-tokenized choices and length- or byte-normalized scoring.
  • Classification / NLI / NLU: SuperGLUE (BoolQ, CB, RTE, WiC, WSC, COPA, MultiRC, ReCoRD), WinoGrande, SocialIQA.
  • Reading Comprehension: TriviaQA, NaturalQuestions, DROP.
  • Mathematical Reasoning: MATH, GSM8K (with solution generation and unit test verification).
  • Toxicity & Bias: RealToxicityPrompts.

Each dataset is paired with a canonical “doc_to_text” template (e.g., Question: {...}\nAnswer:), mappings from document to choice or target, preferred metric lists, and explicit version metadata to track prompt and scoring changes (Biderman et al., 2024).

4. Configuration Schema and Experiment Specification

Experiments are configured via YAML or JSON with explicit fields for:

  • Grouping (e.g., for output tables)
  • Task name (matched to implemented Tasks)
  • Dataset path, name, and splits
  • Output type (multiple choice, loglikelihood, generation, rolling loglikelihood)
  • Prompt and target templates
  • Few-shot configurations (split, sampler, number of shots)
  • Metrics (list of metric, aggregation, and “higher is better” flags)
  • Metadata including task version

Example configuration for ARC-Easy in “cloze” style:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
group:   - ai2_arc
task: arc_easy
dataset_path: allenai/ai2_arc
dataset_name: ARC-Easy
output_type: multiple_choice
training_split: train
validation_split: validation
test_split: test
doc_to_text: "Question: {question}\nAnswer:"
doc_to_target: "{choices.label.index(answerKey)}"
doc_to_choice: "{choices.text}"
metric_list:
  - metric: acc
    aggregation: mean
    higher_is_better: true
metadata:
  version: 1.0
(Biderman et al., 2024)

5. Evaluation Pipeline and Interfaces

lm-eval supports both a command-line interface and a Python API for experiment execution. Typical usage specifies the target model, tasks, batch size, seed, and output directory or JSON dump location:

  • Command-line Example:

1
2
3
4
5
6
7
pip install lm-eval
lm-eval \
  --model facebook/opt-2.7b \
  --task arc_easy,mmluzero \
  --batch_size 8 \
  --seed 42 \
  --output_dir results/opt-2.7b

  • Python API Example:

1
2
3
4
5
6
7
8
from lm_eval import evaluator
results = evaluator.simple_evaluate(
  model="facebook/opt-2.7b",
  tasks=["arc_easy","mmlu"],
  batch_size=8,
  seed=42,
  save_path="results/opt-2.7b.json")
print(results)

Output artifacts include per-task metrics, standard errors, task versions, random seed, and code commit hash. Per-sample logs are available for inspection, and all evaluation logic supports early exit/debugging via sample count limits (Biderman et al., 2024).

6. Reproducibility Practices and Extensibility

Reproducibility is enforced through several mechanisms:

  • Random-seed control at global and per-few-shot-sampler levels
  • Task versioning and explicit reporting of metadata.version in outputs
  • Environment capture (code commit hash logged; environment files recommended)
  • Configurable per-sample logging (raw prompts and model outputs)
  • Statistical uncertainty reporting (bootstrapped or analytic standard error for every metric)
  • Debugging features (sample number limits)

lm-eval is designed for extensibility:

  • Adding New Tasks: By subclassing lm_eval.base.Task or providing a compliant YAML config, with supplied methods for doc-to-text, doc-to-choice (for multiple choice), doc-to-target, few-shot sampling, and per-example result processing.
  • Registering New LM Backends: By subclassing lm_eval.base.LM and implementing the relevant Request API, then registering under a unique name.

7. Core Metrics and Notation

The framework implements and exposes a set of standard evaluation metrics, including:

  • Perplexity:

Perp(D;θ)=exp(1Ni=1Nlogpθ(wiw<i))\mathrm{Perp}(D;\theta) = \exp\left(-\tfrac{1}{N}\sum_{i=1}^N \log p_\theta(w_i\mid w_{<i})\right)

where NN is the number of tokens.

BPB=1log2(1Bi=1Nlogpθ(wiw<i))\mathrm{BPB} = \frac{1}{\log 2}\left(-\tfrac{1}{B}\sum_{i=1}^N \log p_\theta(w_i\mid w_{<i})\right)

with BB = number of bytes.

  • Accuracy:

Acc=#correct#total\mathrm{Acc} = \frac{\#\text{correct}}{\#\text{total}}

  • Exact Match: 1 if generated string is identical to reference, else 0.
  • F1 (for span-based QA):

F1=2Precision×RecallPrecision+Recall\mathrm{F_1} = 2\frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}

  • BLEU:

BLEU=BPexp(n=1Nwnlogpn), BP=min(1,e1rc)\mathrm{BLEU} = \mathrm{BP} \cdot \exp\left(\sum_{n=1}^N w_n \log p_n\right),\ \mathrm{BP}=\min\left(1,e^{1-\tfrac{r}{c}}\right)

  • ROUGE-L: Longest common subsequence–based F-score.

(Biderman et al., 2024)

8. Empirical Use Cases, Lessons, and Pitfalls

lm-eval has played a key role in landmark evaluation studies:

  • Multiprompt Evaluation (BigScience/PromptSource): Highlighted significant variance across prompt templates, emphasizing the necessity of prompt robustness checks.
  • Prompt Sensitivity Analyses: For tasks such as ARC and MMLU, varying the prompt style (cloze vs. MMLU) for models like GPT-NeoX-20B, Llama-2-7B, and Falcon-7B resulted in accuracy shifts of 26–56% vs. 38–43% (±2% CI). lm-eval’s uniform task library is essential to avoid misleading, non-equivalent comparisons.
  • Architecture Benchmarks: Supported benchmarking of novel architectures (e.g., Hyena, RWKV, RetNet, H3) by isolating architectural effects from evaluation artifacts.
  • Open LLM Leaderboards: lm-eval facilitated integration of standardized tasks, reducing duplicated effort and promoting consistent model reporting.

Clear lessons have emerged:

  • Always share code, prompt templates, and model outputs.
  • Rerun baselines directly using lm-eval rather than copying results.
  • Conduct qualitative sanity checks before large-scale runs.
  • Report uncertainty (e.g., standard errors, CIs) and use multiple seeds or few-shot draws.
  • Pin task and code versions, rigorously modularize benchmarks, and foster community contributions to a shared harness (Biderman et al., 2024).

9. Significance and Community Impact

In over three years of open-source deployment, lm-eval has become a cornerstone tool for the rigorous evaluation of LLMs, widely adopted for benchmarking, sensitivity analysis, and leaderboard construction. By enforcing methodological rigor—through versioned code, shared benchmarks, and explicit uncertainty reporting—lm-eval directly addresses long-standing pitfalls in LM evaluation, enabling transparent, reproducible, and extensible research practices throughout the field (Biderman et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EleutherAI Language Model Evaluation Harness.