FollowEval Benchmark: Instruction Adherence for LLMs
- FollowEval is a multi-dimensional benchmark that rigorously evaluates LLM instruction-following via hand-crafted, rule-verified test cases in English and Chinese.
- It employs deterministic scoring using regex and JSON-schema validations to assess diverse constraints like string manipulation and logical reasoning.
- Experimental results reveal significant reliability gaps, highlighting LLM failures in strict format adherence and the interpretation of multi-constraint instructions.
FollowEval is a multi-dimensional benchmark suite designed to rigorously assess the instruction-following capabilities of LLMs. Its evaluation protocol centers on tasks that require high-fidelity adherence to compositional, multi-facet constraints in both general text generation and API-style function calling scenarios. In contrast to previous approaches that focus on broad semantic correctness or loose type constraints, FollowEval enforces exact formal requirements by leveraging hand-authored or schema-embedded verifiable rules, yielding deterministic, fully algorithmic scoring. Notably, the benchmark includes instances in both English and Chinese, crafted by domain experts, and is engineered to expose the brittleness and limitations of contemporary LLMs in instruction interpretation, multi-constraint satisfaction, and format compliance across diverse domains.
1. Motivation and Conceptual Foundations
FollowEval addresses fundamental limitations in prior instruction-following evaluation methodologies. Earlier benchmarks, such as BFCL, τ²-Bench, and ACEBench, primarily emphasized function selection and basic argument type correctness. These approaches inadequately capture the practical rigor demanded by real-world applications, where APIs impose strict formatting constraints—such as ISO 8601 dates, quoting requirements, or exclusion of punctuation—that, if not satisfied, result in system-level failures or unhandled exceptions. FollowEval's design principles explicitly target this gap by introducing instruction-following tasks with compositional constraints and formally verifiable output requirements. This systematic focus reflects the imperative in both agent-based and general text-generation use cases for models to reliably interpret and execute multi-dimensional instructions.
2. Benchmark Construction and Multi-Dimensional Design
The original FollowEval benchmark (Jing et al., 2023) comprises 200 hand-curated instances, with equal distribution across English and Chinese. Test-case curation follows a rigorous, multi-stage expert process:
- Initial instruction drafting by six domain experts.
- Independent verification and answerability assessment by two reviewers.
- Automated correctness test generation via regular expression specialists.
Each test case is tagged with at least two critical “essential elements” (dimensions):
- String Manipulation: Operations on literal sequences—e.g., substring extraction, character insertion, deletion, and replacement.
- Commonsense Reasoning: Tasks requiring single-step inference from world knowledge.
- Logical Reasoning: Symbolic or numerical reasoning, e.g., counting or comparisons.
- Spatial Reasoning: Geometric or positional manipulation of textual content.
- Response Constraints: Explicit requirements on output, such as length, formality, or character set membership.
A response is marked incorrect unless it satisfies all required elements simultaneously, enforcing compound instruction adherence. Each task example was further verified for unambiguous regex-checkable output.
3. Schema Encoding and Evaluative Protocol in Function Calling
FollowEval's function-calling benchmark variant (IFEval-FC) (Skripko, 22 Sep 2025) systematizes strict format adherence by embedding verifiable instructions directly in JSON-schema style parameter descriptions. Each function instance defines at least one "free-form" string parameter for which the model must synthesize an argument that conforms precisely to a natural-language described constraint. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
{
"name": "createEvent",
"parameters": {
"type": "object",
"properties": {
"event_name": {
"type": "string",
"description": "Must be uppercase letters only, no punctuation."
},
"start_date": {
"type": "string",
"description": "Date must use ISO 8601 format YYYY-MM-DD."
}
},
"required": ["event_name", "start_date"]
}
} |
Description strings are mapped to formal regex patterns, such as:
- Punctuation-free: $^\verb|^[^\p{Punct}]+$|\verb|\d{4}-\d{2}-\d{2}
Each model output is evaluated by extracting the relevant parameter and applying the corresponding validator.
4. Task Domains, Instruction Types, and Test Case Generation
IFEval-FC expands the evaluation to 80 real-world domains and approximately 150 distinct function definitions, balanced across 19 instruction types. These types span:
- Keyword presence and frequency.
- Letter-frequency constraints.
- Format requirements (e.g., valid JSON, Python list syntax).
- Alphabet restrictions (Cyrillic/Greek only).
- Quotation/enclosure.
- Sentence or word counts.
- Casing (title-case, uppercase).
Each function is matched with five natural-language user requests, resulting in 750 test cases. Generation and balancing ensure each instruction type is uniformly represented. A verification pipeline eliminates ill-posed or trivially solvable items, confirmed via an ensemble of existing LLMs under a "force-call" regime. Final human review ensures that every instruction admits a unique correct parse.
5. Scoring Methodology and Metrics
FollowEval employs deterministic, strict-match scoring with no partial credit. The core metric is accuracy, computed as:
where model responses are validated via regex or schema-based rules. In IFEval-FC, per-instruction-type precision and recall are defined as:
No human or LLM judges are involved; results are fully algorithmic and reproducible.
6. Experimental Results and Comparative Analysis
Both the original and function-calling variants of FollowEval reveal pronounced failure modes in current LLMs:
- General Instruction-Following (200 tasks):
- Human baseline: 1.000 accuracy in both languages.
- GPT-4: Overall 0.775; Chinese 0.81; English 0.74.
- GPT-3.5-Turbo: 0.650 overall.
- Open-source models (Qwen, Baichuan) score substantially lower, with performance strongly correlating with parameter count.
- Language bias: Models achieve marginally higher performance in Chinese, attributed to training data composition.
- Function Calling (IFEval-FC, 750 tasks):
- GPT-5 minimal: ≈72.4% overall.
- Claude 4.1 Opus: ≈67.8%.
- GPT-4o: ≈65.2%.
- Earlier model baselines (GPT-4.1, Haiku series) lag by 10–30 points.
- Most instruction types fall well below 80% consistency, except for simple keyword presence.
Failure modes include:
- Frequent violation of JSON format constraints (>60% miss rate except Opus).
- Wrapping/quoting requirements achieved 30–50% success.
- Alphabet constraints (Cyrillic/Greek only) <50%.
- Keyword-frequency instructions systematically mishandled (40–60%).
Proprietary models outperform open-source variants, and accuracy increases with size.
| Model | Chinese AVG | English AVG | Overall AVG |
|---|---|---|---|
| Human Baseline | 1.000 | 1.000 | 1.000 |
| GPT-4 | 0.81 | 0.74 | 0.775 |
| GPT-3.5-Turbo | 0.64 | 0.66 | 0.650 |
| Qwen-14B-Chat | 0.62 | 0.43 | 0.525 |
| Qwen-7B-Chat | 0.55 | 0.36 | 0.448 |
| Baichuan-2-13B | 0.43 | 0.39 | 0.408 |
No statistical significance tests were reported in either benchmark.
7. Implications, Current Limitations, and Future Directions
FollowEval exposes a substantial and persistent gap between human-level and state-of-the-art machine performance in both general multi-constraint instruction following and precise function argument formatting. This has significant ramifications for safety and reliability in agent-based deployments, where minute deviations—such as a misplaced quote or invalid date—can induce severe downstream failures. Key limitations include:
- Most current benchmarks emphasize single instructions or single-function scenarios; multi-function selection and cross-field dependencies remain underexplored.
- Language coverage is presently limited, especially in function calling cases.
- Compound constraints across multiple parameters are not yet incorporated.
Planned future directions for FollowEval propose expanding to multi-function, multilingual benchmarks, introducing more complex structural constraints (nested JSON, YAML, CSV), and investigating prompting or fine-tuning strategies to enhance LLM robustness in parsing and executing formalized constraints.
A plausible implication is that the continued reliance on purely token-level or semantic evaluation frameworks will be inadequate for high-assurance agent systems in production contexts. The algorithmic, compositional approach embodied by FollowEval is positioned as necessary for driving meaningful progress in reliable instruction-following capabilities.