LLMs as Estimable Classifiers

Updated 26 November 2025

Large language models as estimable classifiers are advanced generative models that output class probabilities by jointly modeling inputs and labels using prompt engineering and instruction tuning.
Their training regimes combine dataset aggregation, soft target calibration, and zero-shot or in-context strategies to achieve data-efficient and scalable classification across diverse domains.
Empirical results show these LLM-based classifiers achieve competitive accuracy and calibrated predictions against traditional discriminative models in text, tabular, and time-series tasks.

LLMs as estimable classifiers represent a paradigm in which generative, instruction-tuned, and sometimes zero-shot or in-context LLMs are repurposed to deliver statistically meaningful predictions of class labels or probabilities across a wide spectrum of data modalities, tasks, and operating regimes. This approach reframes classification as joint modeling, harnesses prompt engineering and rich supervision strategies, and leverages the representational power of large-scale pretraining for data-efficient, scalable, and empirically robust estimation—even in settings previously dominated by task-specific discriminative models.

1. Model Formulation: Generative and Instruction-Conditioned Classification

LLMs can be “lifted” from conventional text generation into reliable, estimable classifiers by recasting the classification problem as joint modeling of inputs and labels. Rather than directly optimizing $p(y|x)$ for each prediction task, modern approaches such as UniPredict construct a joint generative model $p_\theta(y,x)$ and maximize the joint log-likelihood over observed data,

$L(\theta) = \sum_{i=1}^N \log p_\theta(y_i, x_i).$

At inference, one recovers the conditional class probabilities via Bayes' rule: $p_\theta(y|x) = \frac{p_\theta(y,x)}{\sum_{y'} p_\theta(y', x)}.$ This is operationalized in LLMs by serializing feature values, metadata, and natural-language instructions into a prompt that conditions generation, so that the model outputs explicit probability strings such as: "class 0: 0.32; class 1: 0.59; class 2: 0.09." Extracting and normalizing the generated floats yields estimable class probabilities $p_\theta(y|x)$ (Wang et al., 2023).

Instruction tuning is pivotal: by indicating the target type or category in natural language, it is possible to dynamically encode task identity—supporting universal application across heterogeneous tabular schemas, document genres, or multi-label taxonomies without retraining per task or output column.

2. Training Regimes, Prompt Design, and Task Encoding

LLMs-as-classifiers typically undergo fine-tuning or advanced prompting strategies tailored to the expected downstream usage:

Dataset Aggregation and Soft Targets: Models like UniPredict are trained on large collections (e.g., 169 Kaggle tables spanning medical, financial, and behavioral domains) with careful curation to prevent dataset imbalance. Targets can be augmented by “teacher” models (e.g., isotonic-calibrated XGBoost predictors) to provide soft labels, facilitating calibrated probability estimation and uncertainty quantification (Wang et al., 2023).
Prompt Engineering: Data instances are encoded into prompts combining natural-language metadata, serialized feature clauses, and explicit instruction blocks. Metadata reformatting ensures compatibility across domain-specific schemata. For tasks with dynamic or hierarchical label spaces, prompts may include JSON-encoded taxonomies or label descriptions, as seen in industrial-scale scientific document classification (Tabatabaei et al., 6 Dec 2024).
Objective and Optimization: Training commonly minimizes token-level cross-entropy between the LLM’s output and the serialized label or probability string. For GPT-2–derived systems, optimization hyperparameters such as AdamW (learning rate $5\times 10^{-5}$ , $\beta_1=0.9$ , $\beta_2=0.999$ ) are typical, with no explicit temperature applied at training but with softmax interpretation of logits at inference (Wang et al., 2023).
Zero-shot and In-context Strategies: Pretrained LLMs (e.g., GPT-3.5, GPT-4) can also serve as zero-shot classifiers by transforming classification into a single-step conditional generation problem, selecting a label directly through deterministic prompting (temperature $\approx 0.01$ ), and mapping output to the target set (Wang et al., 2023). For strategic classification, bi-level optimization can be encoded in self-attention via in-context learning with no parameter updates (Lv et al., 10 Nov 2025).

3. Estimation, Calibration, and Inference Mechanisms

During inference, prompt-encoded samples are processed to generate probability estimates for each class or multi-label subset. Key technical details include:

Probability Extraction: For tabular and structured tasks, explicit probability annotations are parsed from the generated text. The predicted label is $\arg\max_j \hat{p}_j$ , where $\hat{p}_j$ is the extracted probability for class $j$ (Wang et al., 2023).
Calibration: Target augmentation with isotonic or Platt-calibrated teachers ensures well-calibrated outputs. Reliability diagrams and post-hoc temperature scaling may be applied to further tighten calibration. Calibration-by-design is also used in adaptive integration frameworks, such as multi-calibration over grid-partitioned $(f(x),z(x))$ pairs (where $f(x)$ is a classical ML score and $z(x)$ is the LLM-derived probability) (Wu et al., 8 May 2024).
Uncertainty Quantification: Ensemble-based approaches quantify both parametric (model-level) and input (prompt-level) variance. Variance decomposition separates lexical (prompt-induced) and conceptual (parameter-driven) uncertainty, providing an interpretable certainty score for each prediction (Rajamohan et al., 12 Feb 2025).
Multi-label Prediction: Autoregressive LLMs in the multi-label regime often emit labels sequentially, with step-wise softmax distributions that are spiky and poorly aligned to true marginal label probabilities. Distribution alignment post-processing (e.g., unary or pairwise breakdown, or compare-to-none) is required to recover calibrated, interpretable $P(\ell|x)$ estimates (Ma et al., 23 May 2025).

4. Comparative Performance and Empirical Properties

LLMs-as-classifiers are empirically benchmarked against established discriminative approaches (XGBoost, FT-Transformer, neural nets) and task-specialized models:

Universal Tabular Classification: On 169 held-out test sets, UniPredict-heavy achieves mean accuracy $\approx 0.81$ , a $+2.2$ pp absolute gain ( $\approx+5.4\%$ relative) over XGBoost and $+13.4\%$ over FT-Transformer. With only $10\%$ training data (“few-shot”), accuracy exceeds XGBoost by $+118\%$ relative (Wang et al., 2023).
Text Classification: Zero-shot GPT-4 achieves comparable or superior accuracy to deep neural models on tasks such as economic sentiment (ACC $71.3\%$ ), e-commerce categorization (ACC $90\%$ ), and SMS spam detection (ACC $97.33\%$ ) (Wang et al., 2023). However, in highly informal or noisy data (COVID-19 tweets), traditional deep models still maintain an edge.
Hierarchical and Large-Scale Multi-label Classification: For SSRN document HMC with thousands of dynamic labels, an LLM pipeline (dense retriever + pointwise LLM classification) achieves $94.3\%$ SME-verified accuracy—substantially higher than reranking ( $70\%$ ) and prior SOTA ( $61.5\%$ ), with cost savings of $>90\%$ per document (Tabatabaei et al., 6 Dec 2024).
Strategic Classification: GLIM (gradient-free in-context bi-level optimization) with GPT-4o exceeds baseline performance by $8$–$20$ pp on large-scale fraud and phishing datasets, matching or exceeding non-strategic accuracy without explicit fine-tuning (Lv et al., 10 Nov 2025).
Multivariate Time Series: In few-shot MTSC (multivariate time series classification), LLMFew (LLM + Patch-wise CNN encoder + LoRA adaptation) yields $125.2\%$ and $50.2\%$ improvement in accuracy over SOTA for Handwriting and EthanolConcentration tasks, respectively (Chen et al., 30 Jan 2025).

5. Modes of Operation and Estimability Principles

LLMs as estimable classifiers realize several architectures and training modalities, each with distinct statistical or computational properties:

Feature-based Probing and Shallow Heads: In-context probing attaches a linear probe to the frozen LLM encoder, extracting contextualized token features and yielding robust, sample-efficient classifiers. This method exhibits lower sensitivity to prompt wording and often matches or exceeds full fine-tuning in low-data regimes (Amini et al., 2023).
Boosted Weak Learning: LLMs can serve as $\gamma$ -weak learners on tabular data, producing interpretable rule summaries or template-based classifiers. Through summary-based boosting (multiclass AdaBoost), weak LLM hypotheses are combined to deliver strong overall accuracy—often outperforming XGBoost on very small datasets (Manikandan et al., 2023).
Integration with Classical Estimators: Probabilistic LLM outputs ( $z(x)$ ) can be adaptively combined with classical ML scores ( $\hat f_n(x)$ ) via linear or piecewise-invariant weighting, yielding composite estimators with strictly lower MSE than either input alone. Calibration frameworks enforce multi-accuracy criteria over discretized confidence grids (Wu et al., 8 May 2024).
Out-of-distribution and Transfer Settings: LLM-based pseudo-labeling enhances transfer learning under covariate shift, improving target-domain accuracy by $9\%$ relative to classical models, and enables reliable operation in domains where labeled data is unavailable or expensive (Wu et al., 8 May 2024, Fu et al., 2023).

6. Limitations, Architectural Constraints, and Open Directions

While LLM-based classifiers are highly general and empirically powerful, several technical and principled limitations have been identified:

Attention Mechanisms: Unidirectional, decoder-only LLMs (e.g., GPT-4, Qwen3) are systematically disadvantaged on tasks requiring access to right-context (e.g., Chinese classifier prediction) compared to bidirectional models like BERT. Even with 100B+ parameters and fine-tuning, LLMs lag by $9$–$30$ pp in accuracy on such tasks, due to inherent architectural constraints (ZiqiZhang et al., 25 Aug 2025).
Prompt and Context Limitations: Context-window size restricts the maximum number of features, examples, or taxonomy nodes that can be embedded in a single prompt. High-cardinality tables, long text sequences, or immense label sets may exceed feasible prompt length (Wang et al., 2023, Tabatabaei et al., 6 Dec 2024).
Calibration and Probability Outputs: Although fine-tuned LLMs can produce well-calibrated soft-labels, raw generation outputs are not always proper probabilities—requiring careful extraction, temperature scaling, or post-hoc calibration (Wang et al., 2023, Tabatabaei et al., 6 Dec 2024).
Multi-label Sequencing and Spikiness: Autoregressive LLMs in multi-label settings emit one label per step with locally spiky distributions, requiring post-hoc distribution alignment (e.g., unary or pairwise breakdowns) to approximate empirical annotator confidence distributions (Ma et al., 23 May 2025).
Data Dependency and Low-resource Regimes: Encoding-based classification generally surpasses generation-based modes except in extremely small datasets, where all approaches may require data augmentation or active learning to reach satisfactory accuracy (Ruan et al., 2 Oct 2024).
Interpretability and Strategic Manipulation: In strategic domains, transparency is preserved by the constructive attention design in in-context learning, but real LLMs' non-linearity complicates formal guarantees (Lv et al., 10 Nov 2025).

7. Empirical Economics and Measurement Error Considerations

The integration of LLMs as estimable classifiers in large-scale empirical pipelines (e.g., Economic Policy Uncertainty indices) substantially reduces measurement error, increases fidelity to human audit benchmarks, and extends classification reach to historical and multilingual corpora. The measurement error $e_t^m$ in aggregate indices decomposes into bias and variance components, with LLM classifiers yielding $+46\%$ F1 improvement over dictionary rules and matching or surpassing out-of-sample human-assessed benchmarks (Hartley, 22 Nov 2025). The explicit estimation and calibration of classifier parameters, systematic handling of thresholds, and probabilistic aggregation are essential for reliable empirical measurement and downstream inference.

In sum, LLMs as estimable classifiers constitute a robust, theoretically grounded, and practically validated methodology for universal, data-efficient, and well-calibrated classification across text, tabular, time-series, and hierarchical taxonomic domains. Their flexibility arises from generative, instruction-tuned architectures; their strengths are documented in both small-sample and industrial-scale settings; and their limitations are increasingly well-understood in terms of architectural constraints and calibration challenges. Continued advances in prompt engineering, calibration schemes, and hybrid architectures are further expanding their applicability and statistical rigor (Wang et al., 2023, Wang et al., 2023, Tabatabaei et al., 6 Dec 2024, Ruan et al., 2 Oct 2024, Amini et al., 2023, Rajamohan et al., 12 Feb 2025, Hartley, 22 Nov 2025, Lv et al., 10 Nov 2025).