Multi-label Instruction Classification

Updated 24 December 2025

Multi-label instruction classification is a method for assigning multiple simultaneous labels to instructional texts, addressing overlapping categories and class imbalance.
It leverages transformer-based models like BERT and XLNet as well as label-enhanced architectures to capture complex label dependencies in procedural guides.
Evaluations on datasets such as wikiHow and CrisisBench demonstrate its effectiveness with high Micro-F1 and Macro-F1 metrics across diverse domains.

Multi-label instruction classification is the task of assigning multiple categorical labels to instructional or procedural text, where each instruction may be relevant to several predefined categories simultaneously. This paradigm extends classical single-label classification and is essential for knowledge base construction, task-oriented learning, and large-scale information retrieval across domains such as procedural guides (e.g., wikiHow articles) and disaster informatics.

1. Problem Definition and Data Preprocessing

Multi-label instruction classification requires mapping a free-form instructional text $x_i$ to a set of $K$ label categories. The target is a binary vector $y_i \in \{0,1\}^K$ , where $y_{i,k} = 1$ if label $k$ applies to instruction $i$ and $0$ otherwise. The domain naturally exhibits label overlap, complex label dependencies, and severe class imbalance, particularly as the label space grows.

A representative dataset is the wikiHow “steps” corpus used in InstructNet (Aurpa et al., 20 Dec 2025), which initially contains 11,121 instructional records and over 6,000 potential categories. To mitigate extreme imbalance, only categories occurring at least 500 times are retained, yielding $K = 67$ final labels. Standard preprocessing comprises special-character removal, stop-word deletion, lowercasing, lemmatization, and tokenization according to the chosen transformer tokenizer.

In disaster informatics, the CrisisBench dataset (Yin et al., 16 Jun 2024) provides multi-label tweet annotations for event type (14 classes), informativeness, and human-aid category (16 classes), requiring simultaneous prediction of multiple dimensions for each instance.

2. Model Architectures for Multi-label Instruction Classification

2.1 Transformer-based Models

Transformer encoders, particularly XLNet (Aurpa et al., 20 Dec 2025) and BERT, have been demonstrated as effective backbones. The instructional text is tokenized to yield input IDs, attention masks, and segment IDs, processed by the encoder to produce a contextualized [CLS] vector ( $d_{\text{model}} = 768$ for both BERT and XLNet Base). A single fully connected head maps the [CLS] representation to $K$ logits using a weight matrix $W \in \mathbb{R}^{K \times 768}$ , followed by a sigmoid activation to produce independent probabilities per label:

$\hat{y}_{i,k} = \sigma(z_{i,k}) = \frac{1}{1+\exp(-z_{i,k})}$

This multi-label head enables $K$ simultaneous binary predictions.

InstructNet (Aurpa et al., 20 Dec 2025) leverages XLNet's permutation-based language modeling and two-stream attention to better capture long-range dependencies and bidirectional context in procedural text compared to BERT's masked LLM. Architecture parameters include $N=12$ transformer blocks, $H=12$ heads, and $d_{ff} = 3072$ .

2.2 Label-Enhanced Feedforward Models

The “labels as hidden nodes” framework (Read et al., 2015) augments each feedforward hidden layer by concatenating linear projections of the true label vector $y$ into each hidden state:

$h^{(\ell)} = f\left( W^{(\ell)} h^{(\ell-1)} + U^{(\ell)} y + b^{(\ell)} \right)$

where $h^{(0)} = x$ and $f$ is a non-linearity (e.g., ReLU). The output is a sigmoid-mapped vector approximating $y$ . This approach implicitly models co-occurrence and dependency structures across labels without explicit modeling or classifier chaining, and scales efficiently with $K$ .

2.3 Instruction-tuned LLMs

CrisisSense-LLM (Yin et al., 16 Jun 2024) demonstrates instruction fine-tuning of LLaMA 2 for multi-label classification. Task formulation involves construction of instruction-oriented prompts that elicit multiple categorical outputs (e.g., event, informativeness, aid-type), formatted as JSON. Parameter-efficient tuning (LoRA) and full-parameter finetuning are supported. Multi-label prediction is operationalized as a conditional language modeling task over concatenated instructions and expected JSON outputs.

3. Training Objective and Optimization

The prevalent objective in explicit multi-label supervised setups is binary cross-entropy, averaged over all labels and instances:

$\mathcal{L}(\theta) = -\frac{1}{N} \sum_{i=1}^N \sum_{k=1}^K \left[ y_{i,k} \log \sigma(z_{i,k}) + (1-y_{i,k}) \log (1-\sigma(z_{i,k})) \right]$

For label-augmented neural approaches (Read et al., 2015), an additional $\ell_2$ regularization term is used on $W^{(\ell)}$ and $U^{(\ell)}$ matrices to constrain model capacity.

Optimization is typically performed using AdamW with warmup and learning rate decay schedules. Default batch sizes (e.g., 48–64), sequence truncation limits (e.g., 512 tokens), and regularization are tuned per architecture and dataset (Aurpa et al., 20 Dec 2025, Yin et al., 16 Jun 2024). In LLM-based approaches, more sophisticated schedules and checkpointing accompany parameter-efficient adaptation.

4. Performance Evaluation and Benchmarks

Evaluation follows the established metrics for multi-label tasks:

Accuracy: $(\mathrm{TP}+\mathrm{TN})/(\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN})$
Micro-F1: computed globally over all predictions
Macro-F1: averaged across all $K$ labels, considering per-label precision and recall

On wikiHow, InstructNet (XLNet) achieves $97.30\%$ accuracy, $89.02\%$ micro-F1, and $93.00\%$ macro-F1, consistently outperforming a BERT baseline (Aurpa et al., 20 Dec 2025). An ablation removing the multi-label head reduces accuracy to $\sim76\%$ , confirming the necessity of independent multi-label outputs.

CrisisSense-LLM attains overall accuracy of $61.3\%$ (full tuning) and $63.8\%$ with checkpoint ensembling for disaster-tweet classification. Task-wise event and informativeness accuracies exceed $80\%$ , while aid-type, reflecting a longer-tailed label distribution, achieves $55$– $60\%$ (Yin et al., 16 Jun 2024). Zero-shot LLaMA 2-chats perform substantially worse, underscoring the impact of task-specific instruction tuning.

Label-enhanced FFN approaches (“labels as hidden nodes”) outperform traditional binary relevance and classifier chain baselines under exact match metrics, while exhibiting linear scalability in $K$ (Read et al., 2015).

5. Modeling Label Dependencies and Label Structure

Explicit modeling of label dependencies is historically seen as crucial in multi-label classification. However, the analysis in (Read et al., 2015) establishes that label dependency is often a consequence of base model limitations rather than inherent to the data. Feeding the label vector as input to middle layers allows the model to capture global co-occurrence patterns and reduces the need for explicitly learned dependency structures or combinatorial inference over label subsets.

InstructNet (Aurpa et al., 20 Dec 2025) does not replicate classifier-chaining, but refers to future directions involving graph-based or co-attentive label modeling, hierarchical architectures, and conditional label embeddings for more refined dependency exploitation. A plausible implication is that such techniques could further improve F1, particularly in settings with high label correlation or when extending to large, open taxonomies.

6. Current Limitations and Future Directions

Both InstructNet and CrisisSense-LLM outline open challenges:

Label Imbalance and Sparsity: Simple frequency filtering (e.g., $K=67$ label cutoff) still leaves long-tail label frequencies and under-represented classes (Aurpa et al., 20 Dec 2025, Yin et al., 16 Jun 2024).
Label Structure Modeling: Proposals include adopting graph-based dependency encoders, hierarchical classification, and label embedding strategies (Aurpa et al., 20 Dec 2025).
Instruction/prompt Sensitivity in LLMs: Instruction-tuned LLMs are sensitive to prompt format; performance drops $14$– $53\%$ if the prompt at inference diverges from the training template (Yin et al., 16 Jun 2024).
Scaling to Longer Inputs: Handling full instructional articles or multi-step procedures may require sparse attention architectures such as Longformer (Aurpa et al., 20 Dec 2025).
Threshold Calibration and Metric Extensions: Thresholding per-label, optimizing Fβ, and extending evaluation to detailed micro/macro metrics are proposed to better accommodate skewed label distributions (Aurpa et al., 20 Dec 2025, Yin et al., 16 Jun 2024).
Catastrophic Forgetting: Particularly in PEFT, models may lose the ability to generate valid outputs after extended training epochs (Yin et al., 16 Jun 2024).
Multi-modal Integration: For richer context (e.g., disaster informatics), future work includes the addition of image and location features to the classification pipeline (Yin et al., 16 Jun 2024).

7. Broader Applications and Generalization

Multi-label instruction classification methods are domain-agnostic. Instruction-based LLM fine-tuning, transformer-based multi-label heads, and label-augmented neural models have demonstrated transferability to domains such as medical coding, multi-aspect sentiment, and procedural text classification in knowledge bases. Parameter-efficient fine-tuning (e.g., LoRA) enables resource-conscious adaptation of large models to new multi-label taxonomies (Yin et al., 16 Jun 2024). Table-driven instruction template approaches facilitate transparent and extensible task specification.

A summary table of representative approaches and their core properties:

Approach	Architecture	Core Mechanism
InstructNet (XLNet)	Transformer + FC Head	Sigmoid multi-label prediction; permutation LM
BERT/BERT-Base Baseline	Transformer + FC Head	Masked LM; sigmoid multi-label prediction
Label-as-Hidden-Nodes	FFN + Label Feedback	Labels as layer input; implicit dependency
CrisisSense-LLM	Instruction-tuned LLM (LLaMA)	Prompt-based multi-output conditioning

The development of robust, efficient, and generalizable multi-label instruction classification systems remains a priority as procedural and multi-aspect textual information proliferates. Recent advances validate both transformer-based and label-augmented architectures and highlight the continuing need for innovations in modeling label structure, prompt robustness, and multi-domain generalization (Aurpa et al., 20 Dec 2025, Read et al., 2015, Yin et al., 16 Jun 2024).