Multi-label Instruction Classification
- Multi-label instruction classification is a method for assigning multiple simultaneous labels to instructional texts, addressing overlapping categories and class imbalance.
- It leverages transformer-based models like BERT and XLNet as well as label-enhanced architectures to capture complex label dependencies in procedural guides.
- Evaluations on datasets such as wikiHow and CrisisBench demonstrate its effectiveness with high Micro-F1 and Macro-F1 metrics across diverse domains.
Multi-label instruction classification is the task of assigning multiple categorical labels to instructional or procedural text, where each instruction may be relevant to several predefined categories simultaneously. This paradigm extends classical single-label classification and is essential for knowledge base construction, task-oriented learning, and large-scale information retrieval across domains such as procedural guides (e.g., wikiHow articles) and disaster informatics.
1. Problem Definition and Data Preprocessing
Multi-label instruction classification requires mapping a free-form instructional text to a set of label categories. The target is a binary vector , where if label applies to instruction and $0$ otherwise. The domain naturally exhibits label overlap, complex label dependencies, and severe class imbalance, particularly as the label space grows.
A representative dataset is the wikiHow “steps” corpus used in InstructNet (Aurpa et al., 20 Dec 2025), which initially contains 11,121 instructional records and over 6,000 potential categories. To mitigate extreme imbalance, only categories occurring at least 500 times are retained, yielding final labels. Standard preprocessing comprises special-character removal, stop-word deletion, lowercasing, lemmatization, and tokenization according to the chosen transformer tokenizer.
In disaster informatics, the CrisisBench dataset (Yin et al., 16 Jun 2024) provides multi-label tweet annotations for event type (14 classes), informativeness, and human-aid category (16 classes), requiring simultaneous prediction of multiple dimensions for each instance.
2. Model Architectures for Multi-label Instruction Classification
2.1 Transformer-based Models
Transformer encoders, particularly XLNet (Aurpa et al., 20 Dec 2025) and BERT, have been demonstrated as effective backbones. The instructional text is tokenized to yield input IDs, attention masks, and segment IDs, processed by the encoder to produce a contextualized [CLS] vector ( for both BERT and XLNet Base). A single fully connected head maps the [CLS] representation to logits using a weight matrix , followed by a sigmoid activation to produce independent probabilities per label:
This multi-label head enables simultaneous binary predictions.
InstructNet (Aurpa et al., 20 Dec 2025) leverages XLNet's permutation-based language modeling and two-stream attention to better capture long-range dependencies and bidirectional context in procedural text compared to BERT's masked LLM. Architecture parameters include transformer blocks, heads, and .
2.2 Label-Enhanced Feedforward Models
The “labels as hidden nodes” framework (Read et al., 2015) augments each feedforward hidden layer by concatenating linear projections of the true label vector into each hidden state:
where and is a non-linearity (e.g., ReLU). The output is a sigmoid-mapped vector approximating . This approach implicitly models co-occurrence and dependency structures across labels without explicit modeling or classifier chaining, and scales efficiently with .
2.3 Instruction-tuned LLMs
CrisisSense-LLM (Yin et al., 16 Jun 2024) demonstrates instruction fine-tuning of LLaMA 2 for multi-label classification. Task formulation involves construction of instruction-oriented prompts that elicit multiple categorical outputs (e.g., event, informativeness, aid-type), formatted as JSON. Parameter-efficient tuning (LoRA) and full-parameter finetuning are supported. Multi-label prediction is operationalized as a conditional language modeling task over concatenated instructions and expected JSON outputs.
3. Training Objective and Optimization
The prevalent objective in explicit multi-label supervised setups is binary cross-entropy, averaged over all labels and instances:
For label-augmented neural approaches (Read et al., 2015), an additional regularization term is used on and matrices to constrain model capacity.
Optimization is typically performed using AdamW with warmup and learning rate decay schedules. Default batch sizes (e.g., 48–64), sequence truncation limits (e.g., 512 tokens), and regularization are tuned per architecture and dataset (Aurpa et al., 20 Dec 2025, Yin et al., 16 Jun 2024). In LLM-based approaches, more sophisticated schedules and checkpointing accompany parameter-efficient adaptation.
4. Performance Evaluation and Benchmarks
Evaluation follows the established metrics for multi-label tasks:
- Accuracy:
- Micro-F1: computed globally over all predictions
- Macro-F1: averaged across all labels, considering per-label precision and recall
On wikiHow, InstructNet (XLNet) achieves accuracy, micro-F1, and macro-F1, consistently outperforming a BERT baseline (Aurpa et al., 20 Dec 2025). An ablation removing the multi-label head reduces accuracy to , confirming the necessity of independent multi-label outputs.
CrisisSense-LLM attains overall accuracy of (full tuning) and with checkpoint ensembling for disaster-tweet classification. Task-wise event and informativeness accuracies exceed , while aid-type, reflecting a longer-tailed label distribution, achieves $55$– (Yin et al., 16 Jun 2024). Zero-shot LLaMA 2-chats perform substantially worse, underscoring the impact of task-specific instruction tuning.
Label-enhanced FFN approaches (“labels as hidden nodes”) outperform traditional binary relevance and classifier chain baselines under exact match metrics, while exhibiting linear scalability in (Read et al., 2015).
5. Modeling Label Dependencies and Label Structure
Explicit modeling of label dependencies is historically seen as crucial in multi-label classification. However, the analysis in (Read et al., 2015) establishes that label dependency is often a consequence of base model limitations rather than inherent to the data. Feeding the label vector as input to middle layers allows the model to capture global co-occurrence patterns and reduces the need for explicitly learned dependency structures or combinatorial inference over label subsets.
InstructNet (Aurpa et al., 20 Dec 2025) does not replicate classifier-chaining, but refers to future directions involving graph-based or co-attentive label modeling, hierarchical architectures, and conditional label embeddings for more refined dependency exploitation. A plausible implication is that such techniques could further improve F1, particularly in settings with high label correlation or when extending to large, open taxonomies.
6. Current Limitations and Future Directions
Both InstructNet and CrisisSense-LLM outline open challenges:
- Label Imbalance and Sparsity: Simple frequency filtering (e.g., label cutoff) still leaves long-tail label frequencies and under-represented classes (Aurpa et al., 20 Dec 2025, Yin et al., 16 Jun 2024).
- Label Structure Modeling: Proposals include adopting graph-based dependency encoders, hierarchical classification, and label embedding strategies (Aurpa et al., 20 Dec 2025).
- Instruction/prompt Sensitivity in LLMs: Instruction-tuned LLMs are sensitive to prompt format; performance drops $14$– if the prompt at inference diverges from the training template (Yin et al., 16 Jun 2024).
- Scaling to Longer Inputs: Handling full instructional articles or multi-step procedures may require sparse attention architectures such as Longformer (Aurpa et al., 20 Dec 2025).
- Threshold Calibration and Metric Extensions: Thresholding per-label, optimizing Fβ, and extending evaluation to detailed micro/macro metrics are proposed to better accommodate skewed label distributions (Aurpa et al., 20 Dec 2025, Yin et al., 16 Jun 2024).
- Catastrophic Forgetting: Particularly in PEFT, models may lose the ability to generate valid outputs after extended training epochs (Yin et al., 16 Jun 2024).
- Multi-modal Integration: For richer context (e.g., disaster informatics), future work includes the addition of image and location features to the classification pipeline (Yin et al., 16 Jun 2024).
7. Broader Applications and Generalization
Multi-label instruction classification methods are domain-agnostic. Instruction-based LLM fine-tuning, transformer-based multi-label heads, and label-augmented neural models have demonstrated transferability to domains such as medical coding, multi-aspect sentiment, and procedural text classification in knowledge bases. Parameter-efficient fine-tuning (e.g., LoRA) enables resource-conscious adaptation of large models to new multi-label taxonomies (Yin et al., 16 Jun 2024). Table-driven instruction template approaches facilitate transparent and extensible task specification.
A summary table of representative approaches and their core properties:
| Approach | Architecture | Core Mechanism |
|---|---|---|
| InstructNet (XLNet) | Transformer + FC Head | Sigmoid multi-label prediction; permutation LM |
| BERT/BERT-Base Baseline | Transformer + FC Head | Masked LM; sigmoid multi-label prediction |
| Label-as-Hidden-Nodes | FFN + Label Feedback | Labels as layer input; implicit dependency |
| CrisisSense-LLM | Instruction-tuned LLM (LLaMA) | Prompt-based multi-output conditioning |
The development of robust, efficient, and generalizable multi-label instruction classification systems remains a priority as procedural and multi-aspect textual information proliferates. Recent advances validate both transformer-based and label-augmented architectures and highlight the continuing need for innovations in modeling label structure, prompt robustness, and multi-domain generalization (Aurpa et al., 20 Dec 2025, Read et al., 2015, Yin et al., 16 Jun 2024).