Instruction Taxonomy: A Comprehensive Overview

Updated 6 May 2026

Instruction taxonomy is a formal classification system that decomposes user instructions into distinct, actionable categories.
It supports practical applications in natural language generation, multilingual evaluation, and curriculum learning through structured benchmarks.
Empirical studies show that taxonomy-driven refinements improve metrics like ROUGE-L F1 and instruction-following accuracy across diverse domains.

Instruction taxonomy refers to formal classification systems that decompose, structure, and label the space of user instructions—especially as presented to machine learning models and software systems—into distinct, theoretically and empirically motivated categories. Instruction taxonomies serve as foundational tools for research in natural language generation (NLG), LLM alignment, multilingual model evaluation, curriculum learning, entity set expansion, engineering benchmarks, classroom pedagogy, and multimodal learning. Across these domains, taxonomies enable precise benchmarking, systematic dataset construction, controlled instruction tuning, and principled error analysis. The following sections synthesize contemporary instruction taxonomy research, centering on NLG ambiguity (Niwa et al., 2024), constraint and response type (Li et al., 10 Mar 2025, Jin et al., 1 May 2026), interaction and dependency (Zhao et al., 2024), domain coverage (Li et al., 2024, Naser et al., 16 Feb 2026), and pedagogical analysis (Karjanto et al., 2016, Zhou et al., 6 Mar 2026).

1. Taxonomy Design in Natural Language Generation: Ambiguity Dimensions

AmbigNLG introduces a six-category taxonomy to formally capture and mitigate ambiguous or under-specified user instructions in NLG tasks (Niwa et al., 2024). Each ambiguity type is defined by concrete semantic gaps that can impede LLM performance:

Category	Definition	Clarification Template
Context	Lacks background or situational information	"Additional context: ___"
Keywords	Unspecified domain terms to include/exclude	"Include ___ in your response."
Length	Missing output length constraints	"Answer with ___ words."
Planning	Absent structure or output outline	"Generate output based on outline: 1. ___ 2. ___ "
Style	Unspecified rhetorical style or tone	"Write in a ___ style."
Theme	Unclear focus or sub-topic within task	"Primarily discuss the theme: ___."

Formally, ambiguity arises whenever $\mathcal{Y}_{\mathrm{valid}} \supsetneq \mathcal{Y}_{\mathrm{desired}}$ , i.e., the set of valid outputs for an instruction strictly over-approximates the user's actual intent. The taxonomy supports a template-based compositional refinement protocol, where ambiguous base instructions are expanded using slot-filling of category-specific prompts. Empirically, integrating this taxonomy-driven refinement increases ROUGE-L F1 by up to 15 points and enhances instruction-following by ≈0.7 on a [0,1] scale—even for non-frontier LLMs—establishing the functional utility of explicit ambiguity mitigation.

2. Constraint-Based Taxonomies for Multilingual and Structured Tasks

Taxonomic classifications addressing the types of constraints embedded in instructions are critical for evaluating cross-lingual alignment and structured data retrieval:

XIFBench organizes constraints into five mutually exclusive categories with clear semantic boundaries:

Category	Definition/Scope
Content	Required information, inclusions/exclusions, comparisons
Style	Rhetorical form, genre, tone, emotional stance
Situation	Contextual, role, audience, timeline, purpose
Format	Output organization, template, hierarchy, structure
Numerical	Quantitative limits, explicit counts, length/range

Empirical results indicate that Format and Numerical categories are most robust cross-lingually, while Style and Situation degrade substantially in lower-resource languages.

FollowTable introduces a hierarchical taxonomy specifically tailored for structured data retrieval:

$T=(N,E)$ 1

When evaluating retrieval methods, these categories define the axes along which instruction-following is measured (via Instruction Responsiveness Score).

3. Hierarchical and Domain Taxonomies: Knowledge, Skills, and Coverage

Taxonomies are frequently leveraged to guarantee broad disciplinary and cognitive coverage:

GLAN implements a five-level tree $T=(N,E)$ spanning:

Fields (e.g., Natural Sciences, Humanities)
Sub-Fields (e.g., Mathematics)
Disciplines (e.g., Algebra)
Subjects (e.g., Calculus)
Class Sessions (e.g., Derivatives)

Nodes at each level are algorithmically expanded using LLMs, with each leaf mapped to key concepts (atomic units of instruction). This structure underpins curriculum-scale benchmarks and fully synthetic instruction data for generalized instruction following and skill learning.

ERI structures tasks as a four-way cross-product:

$C = F \times S \times I \times D$

$F$ : engineering fields (civil, mechanical, etc.)
$S$ : subdomains (e.g., Statics, Thermodynamics)
$I$ : intent types (Definition, Explanation, Calculation, Comparison, Design/Synthesis, Troubleshooting, Code)
$D$ : rigor tiers (undergraduate, graduate, professional)

This taxonomy enables highly granular, slice-based evaluation and tuning of engineering-capable LLMs, supporting fine control over model routing, curriculum design, and regression testing.

4. Instruction Interaction and Dependency Taxonomies

Beyond static classification, modern taxonomies capture not only which instruction categories exist but also how they interact and relate as pre-requisites for skill acquisition:

Three-tier Dependency Hierarchy: (Zhao et al., 2024) proposes a taxonomy over 29 categories arranged as preliminary (root) skills (e.g., Math Reasoning, Coding), intermediary (reasoning/understanding tasks), and subsequental (complex synthesis tasks such as Creative Writing). Categories are linked by a DAG encoding directional dependencies, empirically established via ablation and performance testing.
Interaction Matrix $\Gamma = [\gamma_{ij}]$ : Quantifies substitution or complementarity; $\gamma_{ij}>0$ indicates that augmenting data in category $C_i$ boosts performance in $T=(N,E)$ 0, guiding sample allocation and curriculum scheduling.
Effect-Equivalence Optimization: Curriculum schema and linear programming are applied to optimize category proportions and training order (preliminary → intermediary → leaf), maximizing transfer and avoiding oversampling highly substitutable categories.

5. Task-Specific and Operational Taxonomies

Instruction taxonomies are customized for specific workflows and annotation practice:

Installation Instruction Taxonomy: (Gao et al., 2023) in software documentation organizes updates into: Pre-installation, Installation, Post-installation, Help Information, Presentation, and External Resource links, supporting template-driven README construction and automated completeness diagnostics.
Tutor Move Taxonomy: (Zhou et al., 6 Mar 2026) maps one-on-one tutoring utterances onto four high-level domains—Tutoring Support, Learning Support (split by student engagement: reasoning-prompting, feedback, hint/explanation/answer), Social-Emotional and Motivational Support, and Logistical Support—for scalable annotation and learning outcome analysis.
Taxonomy-Guided Instruction for Entity Set and Taxonomy Expansion: (Shen et al., 2024) reduces expansion and construction tasks to two core actions: generating sibling nodes (co-hyponyms) and selecting parent nodes (hypernyms), supporting a unified joint instruction-tuned model for multiple graph-based tasks.

6. Application in Multimodal and Multitask Settings

Visual and multimodal models have adopted taxonomic splits for both instruction type and vision task:

Visual Instruction Tuning Taxonomy: (Huang et al., 2023) spans: Discriminative (classification, segmentation, detection, grounding), Generative (image generation/editing), Complex Reasoning (captioning, VQA, multimodal chat), alongside video, 3D, medical, and document assistant families. Within each, four design dimensions (architecture, prompt format, objective, optimization) systematically organize the design space.
Multilingual/Multi-Turn Instruction Taxonomies: The Evol taxonomy (Maheshwary et al., 2024) utilizes a two-stage structure: (a) instruction evolutions (generic and task-specific), and (b) multi-turn continuations (21 fine-grained dialogue operations). This supports controlled growth of multi-turn, multilingual datasets with guaranteed coverage across languages, tasks, and dialogue acts.

7. Practical Implications and Empirical Impact

Explicit taxonomic frameworks enable:

Fine-grained evaluation and error localization by category (e.g., measuring cross-lingual drops in Style/Situation (Li et al., 10 Mar 2025), or content-vs-structure failures in retrieval (Jin et al., 1 May 2026)).
Protocolized instruction refinement with measurable ROUGE or alignment gains (as in AmbigNLG’s 6-way taxonomy (Niwa et al., 2024)).
Curriculum learning schema and sample allocation guided by dependency and interaction graphs for optimal downstream performance (Zhao et al., 2024).
Automated or semi-automated annotation, completeness checking, and schema expansion using structured templates and modular node insertion (Li et al., 2024, Gao et al., 2023).
Enhanced benchmarking and regression testing for capability slices in engineering and domain-specific tasks (Naser et al., 16 Feb 2026, Shen et al., 2024).

Taxonomy design thus underpins both theoretical advances in instruction representation and practical advances in model alignment, fine-tuning, and evaluation across domains, modalities, and languages.