LLM-Driven Machine Learning Methods
- LLM-driven machine learning methods are approaches that integrate foundation models into ML workflows, enhancing data efficiency and interpretability.
- They employ techniques such as LLM-guided priors, synthetic data generation, and natural language-based feature extraction to improve accuracy and robustness.
- LLMs enable automated pipeline design, causal discovery, and collaborative systems, addressing challenges like data scarcity, bias, and complex model calibration.
LLM-driven machine learning methods leverage the knowledge, reasoning, and natural language processing capabilities of foundation models—such as GPT, Llama, or domain-specialized variants—to augment, automate, or reinterpret traditional ML workflows. These systems span a diverse range of technical approaches, including LLM-guided model priors, data augmentation, feature engineering, pipeline automation, causal discovery, code synthesis, and collaborative human-agent frameworks. The following sections delineate key paradigms, technical advancements, practical workflows, and critical challenges evidenced by recent research.
1. LLMs as Knowledge Priors and Model Guides
LLMs are increasingly employed to inject external, domain-grounded priors into classical statistical models, especially in low-data or few-shot learning regimes. One prominent example is the integration of LLM-driven priors for tabular classification (Zhu et al., 2023):
- Ordered Categorical Encoding: LLMs can assess and order categorical variables by relevance to the target variable (e.g., ranking employment types by plausibly expected effect on income), allowing practitioners to replace one-hot encodings with ordinal mappings that reflect semantic hierarchies. This approach dramatically reduces input dimensionality and greatly improves generalization in sparse data environments.
- Correlation Priors for Continuous Features: By prompting an LLM to estimate the sign (positive, negative, or none) of the correlation between features and target, these subjective judgments define priors for model coefficients, injected as a regularization term in the logistic regression loss:
where is the vector of LLM-suggested priors, and tunes the penalty strength.
- Nonlinear Monotonic Enhancements: The MonotonicLR model extends logistic regression by integrating an Unconstrained Monotonic Neural Network (UMNN) that maps ordinal categorical labels to real-valued cardinal representations, preserving LLM-derived order while exposing non-linear relationships.
Empirical tests show these strategies yield superior accuracy and AUC compared to tree ensembles and other LLM-tabular models, particularly under severe data constraints, without sacrificing interpretability (Zhu et al., 2023).
2. LLM-Driven Data Augmentation, Curation, and Feature Engineering
LLMs are increasingly applied as data generators and curators, especially in contexts where acquiring labeled examples is expensive or infeasible.
- LLM-Based Synthetic Data Generation: Methods such as Curated LLM (CLLM) use in-context LLM prompting to synthesize new tabular examples from a minimal real dataset (Seedat et al., 2023). To avoid model drift from spurious or unfaithful samples, a curation stage is introduced:
- Samples are evaluated by a trained classifier at multiple checkpoints, computing average confidence and aleatoric uncertainty:
- Only synthetic examples with confidence above a threshold and uncertainty below another threshold are retained. This strategy substantially improves model performance in the few-shot regime relative to both GAN-based and raw LLM-based generators.
- Interpretable, Compact Feature Extraction from Text: LLMs can be directly prompted to produce low-dimensional, semantically labeled features from unstructured data—such as research abstracts—with each feature corresponding to directly interpretable qualities (e.g., methodological rigor, novelty, grammar) (Balek et al., 11 Sep 2024). These interpretable LLM-derived features permit competitive classification accuracy compared to state-of-the-art high-dimensional embeddings (e.g., SciBERT) while supporting transparent downstream rule mining:
This compactness and interpretability facilitate rule extraction and counterfactual reasoning.
3. LLMs as Augmenters, Calibrators, and Pipelines in Classical ML
LLMs are effective as sophisticated “second opinions” or ensemble components, able to correct, calibrate, or guide classical estimators.
- LLM-Enhanced Combination and Calibration Methods: Approaches include linear and adaptive weighted ensembling of ML and LLM outputs (Wu et al., 8 May 2024):
where is the LLM output; is optimized to minimize prediction loss.
- Conditional Calibration: LLM outputs serve as covariate groups to drive a joint calibration, enforcing “multi-accuracy” by correcting systematic error for slices determined by both ML outputs and LLM predictions.
- Transfer Learning with LLM-Synthesized Labels: To address distributional shift, additional samples from a target distribution are labeled with the LLM and incorporated into the training loss:
Experiments across several datasets confirm these approaches yield robust improvements, especially in handling covariate shifts and borderline prediction cases (Wu et al., 8 May 2024).
4. LLMs for Automated and Agentic Machine Learning Engineering
LLMs can be integrated as code generators, pipeline planners, and autonomous agents that orchestrate the end-to-end ML workflow.
- AutoML with LLM-Driven Agents and Tree Search: SELA employs LLMs as experimenters—proposing and executing components for each pipeline stage—and organizes the solution space as a search tree. Monte Carlo Tree Search (MCTS) efficiently explores pipeline configurations:
This method attains high win rates (65–80%) over standard AutoML and LLM-only baselines on diverse datasets, highlighting the synergy of LLM creativity with systematic search (Chi et al., 22 Oct 2024).
- Learning-based RL Agents for ML Engineering: ML-Agent utilizes online reinforcement learning to fine-tune an LLM agent on realistic ML tasks. The agent learns to generate diverse actions (code edits, hyperparameter changes), leveraging a step-wise RL paradigm and a specialized reward module that unifies divergent ML feedback. Key innovations include exploration-enhanced fine-tuning for diversity, step-wise RL for training efficiency, and reward normalization for robust optimization. Despite being trained on only nine tasks, the 7B model outperforms a competing 671B parameter agent, demonstrating pronounced generalization (Liu et al., 29 May 2025).
5. LLMs for Human-Centric, Collaborative, and Interpretable ML Systems
LLM-driven systems lower barriers to entry for non-experts, mediate collaborative workflows, and improve transparency and user control.
- Human–LLM Collaborative Formulation and Interactive ML: DuetML pairs reactive (user-queried) and proactive (system-initiated) multimodal LLM agents with end-users (Kawabe et al., 28 Nov 2024). The interface supports guidance for dataset construction, label refinement, and model evaluation. Empirical evidence suggests that this paradigm helps non-experts produce more meaningful training data and formulate high-quality ML tasks without increasing cognitive load.
- Natural Language Interfaces for AutoML and Model Assembly: Large-scale studies confirm that LLM-powered natural language interfaces can increase success rates, implementation accuracy, and reduce development/learning time (e.g., a 50% time reduction and 73% faster error correction in a recent organizational paper) (Yao et al., 8 Jul 2025). The system translates user instructions into modular stages—including feature engineering, model selection, and HPO—thus democratizing ML deployment across technical backgrounds.
6. LLMs for Causal Discovery, Explanation Evaluation, and Scientific Reasoning
LLMs are being adapted for upstream scientific workflows, enabling new modalities of reasoning, fairness analysis, and knowledge discovery.
- Causal Discovery with Active Learning and Dynamic Scoring: Fairness-driven frameworks use LLMs to infer causal relationships from metadata/variable descriptions using an active breadth-first query strategy with dynamic scoring (Zanna et al., 21 Mar 2025):
This reduces query complexity from to in sparse settings and systematically identifies direct, indirect, and spurious dependence paths, enhancing fairness analysis in ML systems.
- LLMs as Judges for Explanation Evaluation: Transformer LLMs can rate and assess the quality of ML explanations (LIME, exemplars) along subjective dimensions like understandability and satisfaction (Wang et al., 28 Feb 2025). While current LLM judges align with human ratings for subjective quality, they do not yet match human performance in objective decision-support settings.
- Domain-Expert Reasoning, Literature Mining, and Prompt-Guided Inference: In chemistry and medicine, LLM agents are tightly coupled with curated literature bases or structured knowledge, orchestrating spectral analysis workflows or conducting debate-driven Mendelian diagnosis (Xie et al., 29 Jul 2025, Zhou et al., 10 Apr 2025). Closed-loop multi-turn prompting and agent-based reasoning allow these systems to outperform or match traditional ML models, especially in low-data settings.
7. Limitations, Challenges, and Future Directions
Notwithstanding these advances, several technical challenges are evident:
- Serialization Sensitivity and Bias: LLMs may exhibit format sensitivity and encode societal biases (Zhu et al., 2023). Separating the injection of world knowledge from end-to-end black-box modeling (e.g., as priors or through interpretable feature engineering) helps mitigate these risks.
- Data Hallucination and Curation: LLM-based generators may produce implausible, noisy, or biased synthetic data. Curation via confidence/uncertainty thresholds or downstream filtering is essential (Seedat et al., 2023).
- Limited Objective Reasoning: LLM judges are not yet complete replacements for human domain experts in evaluating explanation effectiveness or providing domain-specific scientific judgments (Wang et al., 28 Feb 2025, Jia et al., 6 Aug 2025).
- Scaling and Domain Adaptation: While LLM methods scale across model sizes and datasets, reliance solely on intrinsic knowledge often limits performance in scientific or quantitatively specialized tasks. External retrieval, continual fine-tuning, or human-in-the-loop corrections remain necessary for high-stakes decision-making (Jia et al., 6 Aug 2025).
In summary, LLM-driven machine learning methods represent a paradigm shift wherein the LLM’s reasoning, knowledge, and interface capabilities are embedded at multiple stages of the ML workflow. From setting interpretable priors and generating features, to pipeline automation and causal inference, these approaches have demonstrated strong empirical gains—particularly in data-scarce or expertise-limited scenarios. Ongoing research seeks to further systematize prompt engineering, scale retrieval and curation, formalize collaboration with humans, and robustly address the limitations associated with bias, knowledge boundaries, and explainability.