Papers
Topics
Authors
Recent
Search
2000 character limit reached

DataComp-LM Classifier

Updated 24 January 2026
  • DataComp-LM Classifier is a modular ensemble framework that combines classical ML predictions with frozen LLM scores for robust binary classification.
  • It employs linear, adaptive weighting, and multi-accuracy calibration strategies to optimize performance under both in-distribution and distribution-shifted conditions.
  • Empirical results on benchmarks demonstrate improved accuracy and enhanced stability over standalone ML and LLM approaches, especially under covariate shift.

The DataComp-LM classifier is a modular algorithmic framework for enhancing classical supervised learning estimators on binary classification tasks by integrating predictions from pre-trained, frozen LLMs. DataComp-LM functions as an ensemble and calibration layer, systematically combining classical ML estimators with LLM inference to provide robust improvements under both in-distribution and distribution-shifted regimes. The architecture never fine-tunes the LLM, instead querying it as an oracle for scalar scores that function as soft labels or pseudo-labels (Wu et al., 2024).

1. System Architecture

Each input is a text pair—typically (query, product)—which is mapped to a dd-dimensional feature vector x∈Rdx \in \mathbb{R}^d via embedding. The system operates in three stages:

  • LLM Oracle: For each input, a prompt is constructed and passed to a frozen pre-trained LLM, e.g., GPT-3.5-Turbo-Instruct, yielding a scalar score z∈[0,1]z \in [0,1] interpreted as the LLM’s probability for the positive class.
  • Base ML Estimator: A classical ML estimator fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1] (e.g., logistic regression) is trained on features and ground-truth labels.
  • DataComp-LM Layer: The outputs fn(x)f_n(x) and zz are combined via one of three strategies: linear or adaptive (piecewise-constant) weighting, a calibration layer (multi-accuracy), or pseudo-label transfer for covariate-shifted test distributions.

The overall model is flexible, operating as a wrapper around arbitrary classical estimators and LLM prompts without retraining the LLM component.

2. Mathematical Formulation

Let x∈Rdx \in \mathbb{R}^d denote the feature vector, z=ϕLLM(raw input)∈[0,1]z = \phi_{LLM}(\text{raw input}) \in [0,1] denote the LLM score, and fn(x)∈[0,1]f_n(x) \in [0,1] the base ML output. Three principal combination strategies are defined:

  • Linear Ensemble:

y^Linear(x)=α⋅fn(x)+(1−α)⋅z\hat{y}^{Linear}(x) = \alpha \cdot f_n(x) + (1-\alpha) \cdot z

The ensemble weight x∈Rdx \in \mathbb{R}^d0 is selected via cross-validation to minimize empirical loss:

x∈Rdx \in \mathbb{R}^d1

  • Adaptive-Weight (AdaLinear) Ensemble:

The range x∈Rdx \in \mathbb{R}^d2 of x∈Rdx \in \mathbb{R}^d3 is partitioned into x∈Rdx \in \mathbb{R}^d4 bins x∈Rdx \in \mathbb{R}^d5, and a piecewise-constant function x∈Rdx \in \mathbb{R}^d6 is learned:

x∈Rdx \in \mathbb{R}^d7

Each bin weight x∈Rdx \in \mathbb{R}^d8 is optimized independently:

x∈Rdx \in \mathbb{R}^d9

  • Calibration via Multi-Accuracy:
    • Naive Multicalibration: z∈[0,1]z \in [0,1]6, with z∈[0,1]z \in [0,1]7 the empirical mean residual on z∈[0,1]z \in [0,1]8.
    • Group-wise Multicalibration: z∈[0,1]z \in [0,1]9 for grid indices fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1]0, with weights fitted via least squares:

    fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1]1

  • Transfer Learning under Covariate Shift:

When fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1]2 (training) and fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1]3 (target) distributions diverge, auxiliary samples fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1]4 are drawn, and LLM pseudo-labels fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1]5 are collected. The objective is to retrain fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1]6 to jointly minimize losses on original and synthetic (LLM-labeled) data:

fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1]7

Here, fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1]8 is a weak-supervision loss to discount potential LLM label noise.

3. Training Workflow

The standardized procedure for fitting DataComp-LM includes:

  1. Feature Embedding: Precompute embeddings fn:Rd→[0,1]f_n: \mathbb{R}^d \rightarrow [0,1]9 for all data, collecting corresponding ground-truth labels fn(x)f_n(x)0.

  2. LLM Querying: For each input, issue prompt to the frozen LLM and record fn(x)f_n(x)1.

  3. Base Model Training: Fit fn(x)f_n(x)2 to fn(x)f_n(x)3.

  4. DataComp-LM Layer Fit: Select and fit the combination/calibration strategy:

    • For ensembles, weights (fn(x)f_n(x)4, fn(x)f_n(x)5) are fit by empirical minimization.
    • For calibration, conditional residuals or group-wise multitask weights are computed.
    • For transfer, augment with fn(x)f_n(x)6 LLM-labeled samples and retrain using mixed-supervision.
  5. No LLM Fine-tuning: The LLM remains frozen throughout; parameter optimization is over the combination/calibration layer only.

4. Inference Process

At inference, fresh inputs (query–product pairs) are processed by:

  1. Feature Extraction: Apply embedding to obtain fn(x)f_n(x)7.
  2. LLM Scoring: Prompt the frozen LLM and collect fn(x)f_n(x)8.
  3. Base Prediction: Compute fn(x)f_n(x)9.
  4. Combination/Calibration: Use the trained DataComp-LM rule to combine zz0 and zz1, yielding final zz2.
  5. Decision Rule: Apply a fixed threshold at zz3 to zz4 for binary outputs.

This procedure implements robust decision-making by leveraging two parallel, independently trained sources of prediction.

5. Addressing Distributional Shift

DataComp-LM addresses covariate shift by augmenting training with pseudo-labeled data sampled from the shifted distribution. As relabeling all samples from zz5 is assumed infeasible, the method employs the LLM as a pseudo-labeler to generate zz6 for additional data drawn from zz7. The training objective mixes the original supervised loss and a relaxed loss for LLM-labeled data, which intuitively balances empirical risk with a regularizing effect from the pseudo-labels. Empirically, zz8 can be taken as zz9, and x∈Rdx \in \mathbb{R}^d0 is chosen to approximate the test covariate distribution in the reweighted training mix (Wu et al., 2024).

6. Implementation Details and Pseudocode

The following pseudocode summarizes the implementation:

x∈Rdx \in \mathbb{R}^d5

7. Empirical Performance

DataComp-LM was validated on four public benchmarks: WANDS (Wayfair relevance), Yelp sentiment, Emotion classification, and Hate-speech detection. Across all tasks, DataComp-LM variants produced consistent accuracy gains over both baseline ML and standalone LLM oracles.

Dataset LLM ML Linear AdaLinear Calibration
WANDS 77.5% 80.3% 83.8% 84.6% 84.0%
Yelp 72.4% 69.1% 73.9% 74.2% 74.3%
Emotion 75.9% 79.9% 80.6% 81.2% n/a
Hate 67.2% 71.7% 72.0% 72.4% 73.1%

In a constructed covariate shift (Tablex∈Rdx \in \mathbb{R}^d1Bed relevance), ML performance dropped from x∈Rdx \in \mathbb{R}^d2 (Table) to x∈Rdx \in \mathbb{R}^d3 (Bed), but Transfer-LLM recovered to x∈Rdx \in \mathbb{R}^d4 on the Bed set with only a small loss on Table. This result demonstrates the robustness of DataComp-LM under substantial distribution shift (Wu et al., 2024).

A plausible implication is that DataComp-LM provides a general recipe for deploying LLM-augmented classifiers in settings where retraining or relabeling is infeasible, leveraging LLMs for soft supervision, ensembling, or calibration to achieve state-of-the-art accuracy and stability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DataComp-LM Classifier.