DataComp-LM Classifier

Updated 24 January 2026

DataComp-LM Classifier is a modular ensemble framework that combines classical ML predictions with frozen LLM scores for robust binary classification.
It employs linear, adaptive weighting, and multi-accuracy calibration strategies to optimize performance under both in-distribution and distribution-shifted conditions.
Empirical results on benchmarks demonstrate improved accuracy and enhanced stability over standalone ML and LLM approaches, especially under covariate shift.

The DataComp-LM classifier is a modular algorithmic framework for enhancing classical supervised learning estimators on binary classification tasks by integrating predictions from pre-trained, frozen LLMs. DataComp-LM functions as an ensemble and calibration layer, systematically combining classical ML estimators with LLM inference to provide robust improvements under both in-distribution and distribution-shifted regimes. The architecture never fine-tunes the LLM, instead querying it as an oracle for scalar scores that function as soft labels or pseudo-labels (Wu et al., 2024).

1. System Architecture

Each input is a text pair—typically (query, product)—which is mapped to a $d$ -dimensional feature vector $x \in \mathbb{R}^d$ via embedding. The system operates in three stages:

LLM Oracle: For each input, a prompt is constructed and passed to a frozen pre-trained LLM, e.g., GPT-3.5-Turbo-Instruct, yielding a scalar score $z \in [0,1]$ interpreted as the LLM’s probability for the positive class.
Base ML Estimator: A classical ML estimator $f_n: \mathbb{R}^d \rightarrow [0,1]$ (e.g., logistic regression) is trained on features and ground-truth labels.
DataComp-LM Layer: The outputs $f_n(x)$ and $z$ are combined via one of three strategies: linear or adaptive (piecewise-constant) weighting, a calibration layer (multi-accuracy), or pseudo-label transfer for covariate-shifted test distributions.

The overall model is flexible, operating as a wrapper around arbitrary classical estimators and LLM prompts without retraining the LLM component.

2. Mathematical Formulation

Let $x \in \mathbb{R}^d$ denote the feature vector, $z = \phi_{LLM}(\text{raw input}) \in [0,1]$ denote the LLM score, and $f_n(x) \in [0,1]$ the base ML output. Three principal combination strategies are defined:

Linear Ensemble:

$\hat{y}^{Linear}(x) = \alpha \cdot f_n(x) + (1-\alpha) \cdot z$

The ensemble weight $x \in \mathbb{R}^d$ 0 is selected via cross-validation to minimize empirical loss:

$x \in \mathbb{R}^d$ 1

Adaptive-Weight (AdaLinear) Ensemble:

The range $x \in \mathbb{R}^d$ 2 of $x \in \mathbb{R}^d$ 3 is partitioned into $x \in \mathbb{R}^d$ 4 bins $x \in \mathbb{R}^d$ 5, and a piecewise-constant function $x \in \mathbb{R}^d$ 6 is learned:

$x \in \mathbb{R}^d$ 7

Each bin weight $x \in \mathbb{R}^d$ 8 is optimized independently:

$x \in \mathbb{R}^d$ 9

Calibration via Multi-Accuracy:
- Naive Multicalibration: $z \in [0,1]$ 6, with $z \in [0,1]$ 7 the empirical mean residual on $z \in [0,1]$ 8.
- Group-wise Multicalibration: $z \in [0,1]$ 9 for grid indices $f_n: \mathbb{R}^d \rightarrow [0,1]$ 0, with weights fitted via least squares:
$f_n: \mathbb{R}^d \rightarrow [0,1]$ 1
Transfer Learning under Covariate Shift:

When $f_n: \mathbb{R}^d \rightarrow [0,1]$ 2 (training) and $f_n: \mathbb{R}^d \rightarrow [0,1]$ 3 (target) distributions diverge, auxiliary samples $f_n: \mathbb{R}^d \rightarrow [0,1]$ 4 are drawn, and LLM pseudo-labels $f_n: \mathbb{R}^d \rightarrow [0,1]$ 5 are collected. The objective is to retrain $f_n: \mathbb{R}^d \rightarrow [0,1]$ 6 to jointly minimize losses on original and synthetic (LLM-labeled) data:

$f_n: \mathbb{R}^d \rightarrow [0,1]$ 7

Here, $f_n: \mathbb{R}^d \rightarrow [0,1]$ 8 is a weak-supervision loss to discount potential LLM label noise.

3. Training Workflow

The standardized procedure for fitting DataComp-LM includes:

Feature Embedding: Precompute embeddings $f_n: \mathbb{R}^d \rightarrow [0,1]$ 9 for all data, collecting corresponding ground-truth labels $f_n(x)$ 0.
LLM Querying: For each input, issue prompt to the frozen LLM and record $f_n(x)$ 1.
Base Model Training: Fit $f_n(x)$ 2 to $f_n(x)$ 3.
DataComp-LM Layer Fit: Select and fit the combination/calibration strategy:
- For ensembles, weights ( $f_n(x)$ 4, $f_n(x)$ 5) are fit by empirical minimization.
- For calibration, conditional residuals or group-wise multitask weights are computed.
- For transfer, augment with $f_n(x)$ 6 LLM-labeled samples and retrain using mixed-supervision.
No LLM Fine-tuning: The LLM remains frozen throughout; parameter optimization is over the combination/calibration layer only.

4. Inference Process

At inference, fresh inputs (query–product pairs) are processed by:

Feature Extraction: Apply embedding to obtain $f_n(x)$ 7.
LLM Scoring: Prompt the frozen LLM and collect $f_n(x)$ 8.
Base Prediction: Compute $f_n(x)$ 9.
Combination/Calibration: Use the trained DataComp-LM rule to combine $z$ 0 and $z$ 1, yielding final $z$ 2.
Decision Rule: Apply a fixed threshold at $z$ 3 to $z$ 4 for binary outputs.

This procedure implements robust decision-making by leveraging two parallel, independently trained sources of prediction.

5. Addressing Distributional Shift

DataComp-LM addresses covariate shift by augmenting training with pseudo-labeled data sampled from the shifted distribution. As relabeling all samples from $z$ 5 is assumed infeasible, the method employs the LLM as a pseudo-labeler to generate $z$ 6 for additional data drawn from $z$ 7. The training objective mixes the original supervised loss and a relaxed loss for LLM-labeled data, which intuitively balances empirical risk with a regularizing effect from the pseudo-labels. Empirically, $z$ 8 can be taken as $z$ 9, and $x \in \mathbb{R}^d$ 0 is chosen to approximate the test covariate distribution in the reweighted training mix (Wu et al., 2024).

6. Implementation Details and Pseudocode

The following pseudocode summarizes the implementation:

$x \in \mathbb{R}^d$ 5

7. Empirical Performance

DataComp-LM was validated on four public benchmarks: WANDS (Wayfair relevance), Yelp sentiment, Emotion classification, and Hate-speech detection. Across all tasks, DataComp-LM variants produced consistent accuracy gains over both baseline ML and standalone LLM oracles.

Dataset	LLM	ML	Linear	AdaLinear	Calibration
WANDS	77.5%	80.3%	83.8%	84.6%	84.0%
Yelp	72.4%	69.1%	73.9%	74.2%	74.3%
Emotion	75.9%	79.9%	80.6%	81.2%	n/a
Hate	67.2%	71.7%	72.0%	72.4%	73.1%

In a constructed covariate shift (Table $x \in \mathbb{R}^d$ 1Bed relevance), ML performance dropped from $x \in \mathbb{R}^d$ 2 (Table) to $x \in \mathbb{R}^d$ 3 (Bed), but Transfer-LLM recovered to $x \in \mathbb{R}^d$ 4 on the Bed set with only a small loss on Table. This result demonstrates the robustness of DataComp-LM under substantial distribution shift (Wu et al., 2024).

A plausible implication is that DataComp-LM provides a general recipe for deploying LLM-augmented classifiers in settings where retraining or relabeling is infeasible, leveraging LLMs for soft supervision, ensembling, or calibration to achieve state-of-the-art accuracy and stability.

Markdown Report Issue Upgrade to Chat

References (1)

Large Language Model Enhanced Machine Learning Estimators for Classification (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DataComp-LM Classifier.