DataComp-LM Classifier
- DataComp-LM Classifier is a modular ensemble framework that combines classical ML predictions with frozen LLM scores for robust binary classification.
- It employs linear, adaptive weighting, and multi-accuracy calibration strategies to optimize performance under both in-distribution and distribution-shifted conditions.
- Empirical results on benchmarks demonstrate improved accuracy and enhanced stability over standalone ML and LLM approaches, especially under covariate shift.
The DataComp-LM classifier is a modular algorithmic framework for enhancing classical supervised learning estimators on binary classification tasks by integrating predictions from pre-trained, frozen LLMs. DataComp-LM functions as an ensemble and calibration layer, systematically combining classical ML estimators with LLM inference to provide robust improvements under both in-distribution and distribution-shifted regimes. The architecture never fine-tunes the LLM, instead querying it as an oracle for scalar scores that function as soft labels or pseudo-labels (Wu et al., 2024).
1. System Architecture
Each input is a text pair—typically (query, product)—which is mapped to a -dimensional feature vector via embedding. The system operates in three stages:
- LLM Oracle: For each input, a prompt is constructed and passed to a frozen pre-trained LLM, e.g., GPT-3.5-Turbo-Instruct, yielding a scalar score interpreted as the LLM’s probability for the positive class.
- Base ML Estimator: A classical ML estimator (e.g., logistic regression) is trained on features and ground-truth labels.
- DataComp-LM Layer: The outputs and are combined via one of three strategies: linear or adaptive (piecewise-constant) weighting, a calibration layer (multi-accuracy), or pseudo-label transfer for covariate-shifted test distributions.
The overall model is flexible, operating as a wrapper around arbitrary classical estimators and LLM prompts without retraining the LLM component.
2. Mathematical Formulation
Let denote the feature vector, denote the LLM score, and the base ML output. Three principal combination strategies are defined:
- Linear Ensemble:
The ensemble weight 0 is selected via cross-validation to minimize empirical loss:
1
- Adaptive-Weight (AdaLinear) Ensemble:
The range 2 of 3 is partitioned into 4 bins 5, and a piecewise-constant function 6 is learned:
7
Each bin weight 8 is optimized independently:
9
- Calibration via Multi-Accuracy:
- Naive Multicalibration: 6, with 7 the empirical mean residual on 8.
- Group-wise Multicalibration: 9 for grid indices 0, with weights fitted via least squares:
1
Transfer Learning under Covariate Shift:
When 2 (training) and 3 (target) distributions diverge, auxiliary samples 4 are drawn, and LLM pseudo-labels 5 are collected. The objective is to retrain 6 to jointly minimize losses on original and synthetic (LLM-labeled) data:
7
Here, 8 is a weak-supervision loss to discount potential LLM label noise.
3. Training Workflow
The standardized procedure for fitting DataComp-LM includes:
Feature Embedding: Precompute embeddings 9 for all data, collecting corresponding ground-truth labels 0.
LLM Querying: For each input, issue prompt to the frozen LLM and record 1.
Base Model Training: Fit 2 to 3.
DataComp-LM Layer Fit: Select and fit the combination/calibration strategy:
- For ensembles, weights (4, 5) are fit by empirical minimization.
- For calibration, conditional residuals or group-wise multitask weights are computed.
- For transfer, augment with 6 LLM-labeled samples and retrain using mixed-supervision.
- No LLM Fine-tuning: The LLM remains frozen throughout; parameter optimization is over the combination/calibration layer only.
4. Inference Process
At inference, fresh inputs (query–product pairs) are processed by:
- Feature Extraction: Apply embedding to obtain 7.
- LLM Scoring: Prompt the frozen LLM and collect 8.
- Base Prediction: Compute 9.
- Combination/Calibration: Use the trained DataComp-LM rule to combine 0 and 1, yielding final 2.
- Decision Rule: Apply a fixed threshold at 3 to 4 for binary outputs.
This procedure implements robust decision-making by leveraging two parallel, independently trained sources of prediction.
5. Addressing Distributional Shift
DataComp-LM addresses covariate shift by augmenting training with pseudo-labeled data sampled from the shifted distribution. As relabeling all samples from 5 is assumed infeasible, the method employs the LLM as a pseudo-labeler to generate 6 for additional data drawn from 7. The training objective mixes the original supervised loss and a relaxed loss for LLM-labeled data, which intuitively balances empirical risk with a regularizing effect from the pseudo-labels. Empirically, 8 can be taken as 9, and 0 is chosen to approximate the test covariate distribution in the reweighted training mix (Wu et al., 2024).
6. Implementation Details and Pseudocode
The following pseudocode summarizes the implementation:
5
7. Empirical Performance
DataComp-LM was validated on four public benchmarks: WANDS (Wayfair relevance), Yelp sentiment, Emotion classification, and Hate-speech detection. Across all tasks, DataComp-LM variants produced consistent accuracy gains over both baseline ML and standalone LLM oracles.
| Dataset | LLM | ML | Linear | AdaLinear | Calibration |
|---|---|---|---|---|---|
| WANDS | 77.5% | 80.3% | 83.8% | 84.6% | 84.0% |
| Yelp | 72.4% | 69.1% | 73.9% | 74.2% | 74.3% |
| Emotion | 75.9% | 79.9% | 80.6% | 81.2% | n/a |
| Hate | 67.2% | 71.7% | 72.0% | 72.4% | 73.1% |
In a constructed covariate shift (Table1Bed relevance), ML performance dropped from 2 (Table) to 3 (Bed), but Transfer-LLM recovered to 4 on the Bed set with only a small loss on Table. This result demonstrates the robustness of DataComp-LM under substantial distribution shift (Wu et al., 2024).
A plausible implication is that DataComp-LM provides a general recipe for deploying LLM-augmented classifiers in settings where retraining or relabeling is infeasible, leveraging LLMs for soft supervision, ensembling, or calibration to achieve state-of-the-art accuracy and stability.