ClassifyLM: LLM-Based Classification
- ClassifyLM is a framework that leverages large language models to perform binary, multiclass, and hierarchical classification across text, tabular, and multimodal data.
- It employs diverse prompt strategies, including zero-shot, few-shot, and chain-of-thought, to enhance performance and calibrate predictive outputs.
- The approach balances accuracy and efficiency through tailored model selection, ensemble methods, and rigorous evaluation under data perturbations.
ClassifyLM is a collective term used for frameworks, pipelines, and model recipes that employ LLMs for classification tasks spanning text, tabular, and sometimes multi-modal data. In contemporary research, ClassifyLM encompasses zero-shot, few-shot, prompt-based, and fine-tuned approaches—leveraging the generalization, reasoning, and generative capacities of LLMs for both binary and multiclass classification with a strong emphasis on comparative performance, prompt sensitivity, latency trade-offs, and robustness. This article surveys the foundational principles, technical paradigms, empirical findings, and practical implications of ClassifyLM, emphasizing rigorous evaluation protocols and domain-specific insights.
1. Task Definitions, Problem Formalization, and Model Classes
ClassifyLM covers a broad taxonomy of supervised and weakly supervised classification tasks. Representative paradigms outlined in (Kostina et al., 14 Jan 2025) include:
- Multiclass Text Classification: E.g., prediction of “working remotely,” “not working remotely,” or “not mentioned” from online employee reviews (balanced across three classes, 1 000 samples).
- Binary Classification: E.g., fake-news detection (“fake” vs. “real”, 214 news articles).
- Hierarchical Classification: E.g., legal documents first bifurcated into “Oil and Gas” or “Other,” then subtyped into one of nine subcategories (Hopkins et al., 2023).
- Multi-Label Tagging: E.g., legal document tagging where a sample may be assigned multiple labels, reformulated as sequence generation to encode label dependencies and multiplicity (Johnson et al., 12 Apr 2025).
Formally, a dataset is mapped to predictions through LLMs, either by direct conditional likelihood maximization, next-token prediction, or structured generation. Models evaluated range from decoder-only LLMs (Llama3, GPT-4 Turbo, Mixtral, Zephyr-7B, etc.), encoder-based PLMs (RoBERTa, BERT), to hybrid ensembles and quantized variants.
2. Prompt Engineering and Few-Shot Settings
Prompt design critically mediates ClassifyLM performance and variance. Prompting schemes, detailed in (Kostina et al., 14 Jan 2025), include:
- Zero-Shot (ZS): Only an instruction and raw input; e.g., “Classify the following text into {labels} and return only the label.”
- Few-Shot (FS): Instruction plus 3–5 input–label exemplars concatenated before the input text.
- Chain-of-Thought (CoT): “Explain your reasoning step-by-step before giving the label.”
- Role-Playing (RP) and Emotional Prompting (EP): Manipulation of assistant persona or “mood” within instruction.
- Assistant Naming (NA): Prefixes such as “Alice:” or “GPT:” to calibrate response style.
Combined approaches (e.g. FS+COT+RP+NA) yield best-case peaks but often with increased response variance, especially for mid-range models, while high-capacity LLMs such as Llama3-70B remain relatively stable (<5 F1 points fluctuation across prompts).
Prompt sensitivity is particularly acute for low-parameter or quantized models (variance up to ±20 F1) versus high-end LLMs (variance <5 points). Prompting also allows for domain adaptation in hierarchical or multi-label contexts, by explicitly embedding decision structure or label taxonomies (You et al., 22 Aug 2025, Johnson et al., 12 Apr 2025).
3. Performance, Efficiency, and Model Selection
Performance metrics in ClassifyLM research are standardized around weighted F1-score for multiclass tasks (see (Kostina et al., 14 Jan 2025), Table 1), macro/micro-F1 for multi-label scenarios (Johnson et al., 12 Apr 2025), and accuracy or area under the ROC curve for binary and detection problems. Latency—total inference time per dataset—is also explicitly reported. Empirical comparisons reveal:
| Model | Task | F1 (%) | Time (s) |
|---|---|---|---|
| Llama3-70B | FakeNewsNet | 94.4 | 1500 |
| gpt-4-turbo | Employee Reviews | 87.6 | 2600 |
| RoBERTa | FakeNewsNet | 93.0 | 4 |
| SVM (TF-IDF) | FakeNewsNet | 88.8 | 0.4 |
- Highest F1 is achieved by LLMs (Llama3-70B, GPT-4 Turbo) with advanced prompting at the expense of substantially longer inference times (orders of magnitude slower than classical ML or mid-size PLMs).
- Classical estimators (SVM, NB), RoBERTa, and quantized LLMs offer Pareto-optimal performance-to-latency trade-offs, especially valuable when resources or latency are constrained.
Prompt-dependent gains can deliver +3–7 F1 points, but require model-by-model validation. Notably, quantized 7B–13B LLMs with efficient activation quantization (AWQ/GPTQ) approximate their full-precision counterparts’ F1, providing strong results on limited hardware.
4. Robustness and Data Perturbation Sensitivity
ClassifyLM frameworks demonstrate notably robust performance under data perturbations. For example, under token-level “amputation” (random dropout) and “abbreviation” attacks on product descriptions, few-shot LLMs sustain macro-F1 drops of only ≈8.2%, whereas supervised transformers see 44–52% losses (Gholamian et al., 2024). Key drivers of LLM robustness in such regimes include:
- World knowledge completion: The model fills in omitted attributes using generalization beyond training distribution.
- In-context anchoring: Few-shot exemplars provide resilient semantic grounding.
- Less brittle decoding: Log-probability aggregation is less sensitive to canonicalization errors than tokenized discriminators.
Prompt tailoring (e.g., “Combined-Reason” prompts that warn of missing attributes) yields additional F1 gains. This robustness is significant for compliance-critical or adversarial domains (e.g., trade compliance, e-commerce product categorization).
5. Structured Output, Calibration, and Uncertainty Quantification
Several recent ClassifyLM extensions target improved calibration and sample efficiency:
- Single-Turn Structured Prediction: SALSA (Single-pass Autoregressive LLM Structured Classification) maps class labels to dedicated output tokens and constrains autoregressive decoding to a pre-specified token subset. Logit projection is performed over the class-set, yielding accurate one-pass classification and significant speedups relative to chain-of-thought or per-class querying (Berdichevsky et al., 26 Oct 2025). Performance on GLUE and domain benchmarks is competitive with full fine-tuned encoders.
- Ensembles for Uncertainty: Ensemble-based approaches operate by generating n lexical variants per intent, running greedy inference, and aggregating votes. The normalized vote fraction U is mapped through empirical CDFs to produce calibrated confidence estimates on predictive correctness (Rajamohan et al., 12 Feb 2025).
- Hybrid and Adaptive Calibration: Quantitative improvements are seen by combining LLM probability outputs with classical estimator predictions via linear ensembles, adaptive weighting, multi-calibration, and transfer learning leveraging the LLM as a source of auxiliary labeled data (Wu et al., 2024).
These techniques provide foundations for reliable deployment in high-stakes and distribution-shifted environments.
6. Application-Specific Insights and Best Practices
- High-stakes, Multiclass Tasks: When performance is the primary constraint and latency is secondary, large LLMs with elaborate prompt engineering are preferred (e.g., Llama3-70B for news or review classification; fine-tuned GPT-3.5 for complex legal taxonomy).
- Efficiency-Driven Pipelines: For binary or simpler multiclass tasks—fake-news filtering, sentiment analysis, and others—mid-size PLMs or classical ML approaches retain performance advantages within strict latency or resource ceilings.
- Prompt Validation and Model Selection: Each combination of model architecture, quantization regime, and prompt schema requires bespoke validation. There is no universally optimal prompt: CoT, NA, RP, and few-shot strategies can interact to either stabilize or destabilize F1, especially as model parameter counts decrease.
- Human-in-the-Loop and Continuous Adaptation: Practical frameworks (see (You et al., 22 Aug 2025)) emphasize iterative prompt refinement, class-overlap analysis, and ongoing human review of classifier-drift, bias, and class balance over time.
- Task Transfer and Alignment: In curriculum analysis, e.g. cybersecurity course mapping, ClassifyLM (BERT fine-tuned, (Nijdam et al., 8 Jan 2026)) is used to assign topics to nine standardized Knowledge Areas with macro-F1 = 0.64, validated against expert consensus.
7. Conclusions and Recommendations
ClassifyLM frameworks enable a spectrum of classification use-cases by exploiting LLM pretraining, prompt plasticity, and robust inference protocols. Selection between large LLMs with advanced prompts, fine-tuned PLMs, or efficient classical models should be directly determined by the task’s precision–latency regime and resource constraints. Prompt engineering can yield substantial improvements but must be carefully optimized and validated for each deployment scenario. For complex or evolving domains, iterative human-in-the-loop and bias-mitigation workflows remain indispensable. For high-throughput or cost-sensitive applications, quantized open-source LLMs and classical baselines retain importance. Finally, Ensemble and single-token structured approaches extend the reach of ClassifyLM to calibrated, interpretable, production-grade classification systems. For a comprehensive review and benchmarking across tasks and prompt regimes, see (Kostina et al., 14 Jan 2025).