LLMs as Predictors: Methods & Applications

Updated 11 September 2025

LLMs-as-Predictors are large language models repurposed to predict outcomes by translating various prediction tasks into text prompts.
They leverage zero-shot, few-shot, and chain-of-thought prompting strategies to address challenges in graphs, time series, tabular data, and mobility forecasting.
Empirical studies reveal competitive performance and enhanced interpretability, yet highlight sensitivity to prompt design and high computational costs.

LLMs as predictors refer to the direct use of pretrained LLMs—models originally trained for language understanding and generation—as engines for making predictions in supervised, semi-supervised, or unsupervised tasks across domains such as graphs, time series, tabular data, mobility, and even social or behavioral science. Distinct from traditional architectures purpose-built for specific data types (e.g., graph neural networks for graphs, or LSTMs for time series), LLMs-as-predictors rely on prompt-driven reformulation of prediction tasks into text-based problems, leveraging the models' world knowledge, semantic reasoning, and flexible input handling. This paradigm encompasses zero-shot, few-shot, and prompt-tuned approaches, as well as frameworks that combine LLM predictive outputs with conventional machine learning models or pipelines.

1. Key Principles and Pipeline Design

The foundational principle behind LLMs-as-predictors is the “translation” of prediction problems into natural language or structured prompt formats interpretable by the LLM. For node classification on graphs, for example, each node is encoded as a prompt using its textual attributes, optionally with neighborhood context rendered as text (e.g., a summary of 2-hop neighbors), and the LLM is tasked with directly generating the node label (Chen et al., 2023). Similar pipelines are constructed for time series (where sequences are decomposed and represented as text) (Madarasingha et al., 3 Jun 2025), mobility forecasting (listing historical “stay” events in textual form) (Wang et al., 2023, Beneduce et al., 31 May 2024), and tabular data (prompting with structured rows in a CSV/JSON or Markdown-like format) (Pavlidis et al., 24 Aug 2025, Liu et al., 27 Aug 2025).

Prompting strategies may include:

Zero-shot: Task is specified as a question involving only the target sample
Few-shot or in-context learning: A small set of labeled examples (input–output pairs) are provided prior to the query sample.
Chain-of-thought prompting: The LLM is instructed to reason step by step before issuing a prediction.

The output is typically unstructured text and requires postprocessing—e.g., parsing text to label names via string matching or edit distance—as LLMs do not natively constrain outputs to a discrete label set (Chen et al., 2023).

2. Applications Across Domains

LLMs-as-predictors have been studied in diverse settings:

Domain	Task Type	Data/Pipeline Example
Graphs	Node classification	Node text + neighbor summary → LLM → label (Chen et al., 2023)
Mobility	Next-location prediction	Historical/context stays → prompt → ranked locations (Wang et al., 2023, Beneduce et al., 31 May 2024)
Social sci.	Social feature prediction	Profile feat. prompt → LLM → individual attribute (Yang et al., 20 Feb 2024)
Tabular	Classification/regression	Tabular rows as text → LLM (ICL) → output (Pavlidis et al., 24 Aug 2025, Liu et al., 27 Aug 2025)
Time series	Zero-shot forecasting	Decomposed sequence → LLM → forecast (numeric parsed) (Madarasingha et al., 3 Jun 2025)
Event KGs	Object/multi-event forecast	Event quadruple/quintuple prompts → LLM (Zhang et al., 15 Jun 2024)
Performance	NAS perf. prediction	Hyperparam/instruction prompt → LLM → metric (Jawahar et al., 2023)
Ensemble ML	Expert forecast weighting	Historical expert perf. → prompt → ensemble pred. (Ren et al., 29 Jun 2025)

LLMs-as-predictors are notable for two key capabilities:

Key-value understanding and cross-modal input reasoning (e.g., combining numerical, categorical, and textual variables such as market indices, earnings transcripts, and analyst ratings in financial forecasting) (Ni et al., 13 Aug 2024).
Intrinsic interpretability via chain-of-thought or explanation generation (the model can be explicitly instructed to produce explanations) (Wang et al., 2023, Beneduce et al., 31 May 2024).

3. Strengths and Empirical Performance

LLMs-as-predictors demonstrate several empirically substantiated strengths:

Competitive zero- and few-shot performance for classification tasks: For node classification, best-case accuracies on Pubmed using LLMs reach 90.75%, outperforming shallow baselines and approaching GNN performance (Chen et al., 2023).
Versatility across data types: LLMs can be used without any domain-specific retraining, providing a universal baseline on small tabular datasets (classification accuracy >0.93 on several benchmarks) (Pavlidis et al., 24 Aug 2025).
Cross-domain transferability: LLM predictors for dynamic text-attributed graphs (using agent-based summary and reflection modules) achieve performance comparable to or exceeding fully supervised GNNs, even without dataset-specific training (Lei et al., 5 Mar 2025).
Interpretability and explanatory power: In mobility prediction, LLMs not only rank likely next-locations but can generate interpretable reasoning about temporal and spatial regularities (Wang et al., 2023, Beneduce et al., 31 May 2024).
Robustness to some classes of adversarial attacks: LLMs-as-predictors, especially when incorporating neighbor summarization/fine-tuning, show lower accuracy degradation than MLPs and even some GNNs under textual/structural graph attacks (Guo et al., 16 Jul 2024).
Data augmentation and ensemble augmentation: LLMs can serve as “soft label” makers in transfer learning—aiding ML models to adapt to covariate shift and improve out-of-distribution accuracy (Wu et al., 8 May 2024).

4. Limitations and Failure Modes

Despite their versatility, LLMs-as-predictors display significant weaknesses:

Sensitivity to prompt design and input serialization: Tiny prompt changes (label wording, feature order, format) can cause large swings in predictive accuracy, with error deltas exceeding 80% in some tabular regression settings (Liu et al., 27 Aug 2025). Variable/place name changes and row order disrupt in-context learning effects due to U-shaped attention bias toward prompt ends (Liu et al., 27 Aug 2025).
Limited utility for regression and clustering: LLMs underperform well-tuned specialized ML models for continuous output (negative R² often observed) and fail to deliver stable results in clustering tasks (Pavlidis et al., 24 Aug 2025).
Dependence on latent shortcut correlations for social prediction: Without explicit shortcut features (strongly correlated demographic or affiliation indicators), LLM performance drops to near-chance in individual-level social feature prediction; they tend to default to population-level means rather than individualized inference (Yang et al., 20 Feb 2024).
Context length and scalability constraints: The amount of historical or structural data LLMs can meaningfully attend to is capped by the model’s context window, requiring aggressive preprocessing or context consolidation (Lei et al., 5 Mar 2025, Chen et al., 2023).
High computational and financial cost compared to specialized models, especially for batch or streaming inference at scale.
Ambiguity in evaluation: Generated outputs can appear “wrong” by ground-truth metrics while being semantically reasonable (e.g., labeling a paper “Neural Networks” instead of “Reinforcement Learning,” which might both be plausible) (Chen et al., 2023).

5. Methodological Innovations and Ensemble Integration

Several studies propose ways to combine LLM predictions with other models for improved reliability:

Linear and adaptive ensembling: Weighted or piecewise-linear combinations of LLM and ML model predictions calibrated via cross-validation yield systematic, robust improvements in classification metrics (Wu et al., 8 May 2024).
Post-hoc calibration: LLM predictions can be used to calibrate or correct probability estimates from conventional classifiers, especially in the presence of distribution shift (Wu et al., 8 May 2024).
Multi-model (stacked) approaches: For rare-event prediction (e.g., VC success), LLM-powered feature engineering is used to construct and enrich feature sets, which are then fed into black-box ensemble models (XGBoost, Random Forest, Linear Regression), producing continuous predictions that are thresholded to classify rare outcomes (Kumar et al., 9 Sep 2025).
Distribution-based output parsing: Rather than treating the LLM as a point estimator, the token probability distribution is interpreted as a predictive posterior—critical for election forecasting or uncertainty quantification (Bradshaw et al., 5 Nov 2024).

These innovations allow for more robust, interpretable, and often more accurate predictive pipelines, though the optimal combinations and integration logic remain context- and data-dependent.

6. Interpretability, Explanation, and Transparency

A distinguishing feature of LLM-based predictors—particularly when designed to produce chain-of-thought or “reasons for prediction” explanations—is their capacity for interpretability, both in mobility forecasting (Wang et al., 2023, Beneduce et al., 31 May 2024) and in scenarios calling for feature attribution or decision traceability (Kumar et al., 9 Sep 2025). Sensitivity analysis and parameter weight inspection in multi-model frameworks further enhance transparency (Kumar et al., 9 Sep 2025). For distribution-based prediction, the output distribution itself provides insight into the model’s world knowledge and uncertainty, facilitating algorithmic fidelity checks and bias analysis (Bradshaw et al., 5 Nov 2024).

7. Outlook and Research Directions

Emerging lines of research seek to improve the reliability, scalability, and applicability of LLMs-as-predictors:

Architectural invariance: There is an explicit call to develop LLM architectures or training procedures that are robust to task-irrelevant variations, such as variable/row order or label serialization (Liu et al., 27 Aug 2025).
Hierarchical, multi-agent, and domain-adaptive approaches: Multi-agent LLM frameworks with distinct roles for global and local summarization, as well as knowledge reflection, show promise for dynamic graph prediction and cross-domain adaptability without retraining (Lei et al., 5 Mar 2025).
Integration with retrieval and domain-specific modules: LLMs may be combined with retrieval-augmented tools for overcoming context-length restraints and domain holes (Zhang et al., 15 Jun 2024).
Extended explanation and self-assessment: Advanced prompting for explicit uncertainty quantification and error analysis is being developed, with distribution-based methods enabling rich model introspection (Bradshaw et al., 5 Nov 2024).
Application to rare-event and low-data scenarios: LLMs combined with model ensembles and feature-engineering pipelines provide substantial gains in precision and actionable insight for rare-event prediction in VC and other high-stakes decision processes (Kumar et al., 9 Sep 2025).

Overall, while LLMs-as-predictors supply a flexible and interpretable predictive interface across various data modalities, their practical deployment demands careful prompt engineering, robustness analysis, and often auxiliary integration with model ensembles or domain-specific processing. Robustness to task-irrelevant perturbations and calibration under out-of-distribution settings remain core challenges for future research.