LLMEval Dataset: LLM Evaluation

Updated 19 August 2025

LLMEval Dataset is a comprehensive evaluation resource that benchmarks LLM capabilities through rigorous multi-cohort annotations and dual-task subsets.
It features LLMEval-1 and LLMEval-2 to assess both general abilities in Chinese and domain-specific competence across academic fields.
The dataset employs both manual star scoring and GPT-4-based automatic evaluations, providing insights for robust ranking and future LLM assessment frameworks.

The LLMEval Dataset refers to a family of evaluation resources and frameworks designed to systematically assess the capabilities, reliability, and limitations of LLMs on an array of general and specialized tasks. Rooted in rigorous annotation, multi-faceted evaluation criteria, and large-scale comparative analysis, the LLMEval dataset—first introduced in "LLMEval: A Preliminary Study on How to Evaluate LLMs" (Zhang et al., 2023)—and its subsequent extensions provide a methodological foundation for the evolving science of LLM evaluation.

1. Dataset Structure and Task Coverage

LLMEval is built on two primary subsets targeting comprehensive and domain-specific evaluation:

LLMEval-1 focuses on broad LLM capabilities and includes 453 carefully selected questions spanning 17 task types: factual QA, open-ended question answering, translation, text retrieval, code generation, role-playing, classification, outline generation, math solving, summarization, reading comprehension, poetry generation, reasoning, paragraph and story generation, conversation, and re-writing. Each question is answered by twelve models, producing a total of 5,436 responses and 29,898 response pairings for pairwise evaluation. The task set is entirely in Chinese.

LLMEval-2 targets specialized, disciplinary evaluations, drawing on questions across 12 academic fields (biological science, chemistry, computer science, economics, law, mathematics, medicine, etc.). About 480 questions per field are constructed by domain experts (typically college students), and the annotation protocol isolates objective metrics (e.g., correctness, explanation quality) and subjective ones (accuracy, fluency, informativeness, logicality).

Subset	Content Summary	Key Features
LLMEval-1	17 task types, 5,436 responses, 29,898 pairings	Chinese; diverse, general capabilities
LLMEval-2	12 subjects, ~480 questions, mix of question types	Disciplinary, objective/subjective split

This design ensures that LLMEval comprehensively addresses both the breadth of LLM abilities and detailed subject-level competence.

2. Annotation Procedures and Scoring Criteria

LLMEval’s evaluation methodology integrates both large-scale manual annotation and automatic (LLM-based) scoring:

Manual Evaluation Regimes

Onsite annotators: Apply a 1–3 star system over five criteria: accuracy, fluency, informativeness, logical coherence, and harmlessness, for every response.
Crowd-source annotators: Perform pairwise response comparisons (win/draw/loss judgements).
Public annotators: Contribute via an online evaluation platform in a similar pairwise scheme.

Automatic Evaluation

GPT-4 is employed as an automatic rater, mirroring human annotation protocols in both star scoring and pairwise comparison templates.

Scoring and Ranking Mechanisms

Two main scoring systems are employed:

Star Scoring: Averaged over annotators per response, providing a continuous measure over multiple criteria.
Pairwise Comparison: Each model’s response is compared to one from another model for the same question; decisions are mapped to win/loss/draw, forming a competition-like structure.

For ranking, LLMEval implements:

Points Scoring: Updated recursively for each comparison:

$P_{(A)}' = P_{(A)} + S_{(A)}$

where $S_{(A)}$ is 1 (win), 0.5 (draw), or 0 (loss).

Elo Rating System: Adapted from competitive games, combining "expected win" and an update factor:

$E_{(A)} = \frac{1}{1 + 10^{(R_{(B)} - R_{(A)})/400}}, \quad R_{(A)}' = R_{(A)} + K \cdot (S_{(A)} - E_{(A)})$

The variance in Elo ratings is quantified by:

$\mathrm{Var}[R_{A}^{\infty}] = 7211.27 \cdot p(1-p)$

where $p$ is the win probability.

3. Experimental Scale and Annotation Cohorts

A total of 2,186 annotators contributed to the manual evaluation, yielding 243,337 manual and 57,511 GPT-4-based annotation results. The project distinguishes among three annotator cohorts (onsite, crowd, public) to facilitate systematic comparison of annotation quality, accuracy, and consistency.

Empirical analysis found:

Highest consistency and alignment with reference judgments from onsite annotators.
Lower accuracy and higher variance among public annotators.
Crowdsourcing offers an intermediate trade-off.

Automated GPT-4 scoring achieved closer agreement with humans under the star rating setup compared to pairwise comparison.

4. Analysis and Empirical Findings

LLMEval produces several key findings relevant for LLM evaluation protocol design:

Criterion Discriminability: "Informativeness" and "accuracy" best distinguish models; "harmlessness" provides little differentiation.
Task Sensitivity: Conversational, math, and reasoning tasks most effectively reveal capability differences.
Evaluator Type: Onsite annotation is the most reliable.
Auto-Evaluation Alignment: GPT-4 aligns better with humans on star scoring than in pairwise schemes, but shows a bias toward verbose responses.
Subjectivity: Agreement between human and GPT-4 judges is lower for subjective questions than objective ones.
Scoring Biases: Annotators assign higher scores (by ~9.8%) when answer hints are missing (i.e., models must reason without guidance).
Ranking System Fluctuations: Elo ranking is unstable and sensitive to comparison order, even after 100,000 matches, in contrast to robust point-based methods.

5. Public Availability and Resources

The entirety of LLMEval—including raw responses, all annotation data, evaluation scripts, prompt templates, scoring GUIs, and ranking routines—is publicly released at https://github.com/llmeval. This transparency encourages re-use and extension by the research community, as well as benchmarking new models on identical criteria.

Resource	Description
Annotations	Stars, pairwise, full cohort metadata
Scripts	Rank computation, scorer templates
GUI	Scoring web interface, prompt viewers

6. Methodological Implications and Future Directions

LLMEval’s comprehensive data and analysis reveal fundamental design principles for future LLM evaluation datasets and protocols:

Multiple Criteria, Multi-Cohort Judging: Single-metric or single-rater evaluations miss key nuances in model performance.
Ranking System Selection: Points-based aggregation is both more robust and interpretable than Elo methods under annotator and data uncertainty. Elo ranking, while widely adopted in some prior work, is shown to be highly sensitive to sample order and noise in this context.
Automated Judging Bias: Use of automatic scoring (via LLMs like GPT-4) is cost-effective and reasonably reliable for objective tasks, but introduces response length bias and lower agreement for subjective questions.
Evaluation Protocol Transparency: Open sourcing of both data and evaluation pipeline is essential for community-wide progress and reproducibility.

Ten formalized "conclusions" in the dataset's primary publication (Zhang et al., 2023) establish a blueprint for subsequent LLM evaluation efforts, including the careful design of tasks, evaluator protocols, and the calibration and reporting of evaluation metrics.

7. Context within the LLM Evaluation Landscape

LLMEval represents one of the first large-scale, multidimensional approaches to the "how to evaluate" problem for LLMs, explicitly distinguishing itself from benchmarks that focus solely on task coverage ("what to evaluate") or domain ("where to evaluate"). Its legacy is evident in the widespread adoption of its multi-criteria, multi-cohort, and statistical-comparative principles in later LLM benchmarks, which often cite the LLMEval framework as a methodological precedent.

LLMEval’s ongoing influence is further supported by comparative analyses—e.g., benchmarking against Elo-based evaluations, star scoring harmonization with LLM-based (GPT-4) evaluators, and cross-cohort validation—that elucidate real-world trade-offs and vulnerabilities in LLM assessment. The dataset and its methodology thus continue to inform both the practice and theory of LLM benchmarking and research.

PDF Markdown Chat (Pro)

References (1)

LLMEval: A Preliminary Study on How to Evaluate Large Language Models (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to LLMEval Dataset.