Weak Supervision with Click Data

Updated 15 December 2025

Weak supervision with click-through data is a technique that converts noisy user signals—like clicks and dwell time—into proxy labels for training relevance models.
It employs systematic data extraction, bias correction, and generative weak label aggregation to tackle challenges such as label noise and scalability.
Deployments in production systems show improved NDCG, AUC, and reduced latency, leveraging modern model architectures like transformers and MASM encoders.

Weak supervision with click-through data encompasses a set of methodologies that extract noisy relevance signals from large-scale user interaction logs—primarily clicks, dwell time, skips, and related engagement metrics—to synthesize supervised training datasets for applications such as search ranking, product relevance modeling, and query intent classification. These approaches address fundamental challenges of label noise, bias, and scalability, leveraging both programmatic heuristics and statistical generative models to calibrate and refine implicit supervision signals. When paired with model architectures designed to accommodate uncertain or probabilistic labels, weak supervision pipelines can markedly enhance downstream learning-to-rank precision, AUC, and real-world production metrics, often outperforming naïve baselines or even limited fully-supervised approaches.

1. Data Extraction and Preprocessing from Click Logs

Click-through data consists of logs of (query, document/product, clicks, downstream actions) tuples accumulated from user interactions. Extraction protocols generally include filtering (e.g., bot and ephemeral session removal), sessionization, and engagement signal computation. Representative signals per query–document or query–product pair include raw click count, dwell time between consecutive actions, explicit skip/dismiss flags, and subsequent conversion events such as apply/save or transactions (Vasudevan, 10 Mar 2025, Yao et al., 2021).

Engagement signals are hand-tuned into initial point-wise labels; for example, apply actions may map to a high score, long dwell after clicks to a medium score, and dismissals to a low score: $y_{qd}^{(0)} = f_\text{engage}(\text{clicks}_{qd},\,\text{dwell}_{qd},\,\text{apply}_{qd},\,\text{dismiss}_{qd})$ Position bias—a key confounding factor—is typically alleviated via bias correction schemes that compare calibrated CTRs across rank positions to estimate intrinsic relevance probabilities (Yao et al., 2021). Hard negatives for training, such as query-rewrite adversarial examples, can be systematically generated to focus modeling capacity on semantic mismatch rather than merchandising or superficial factors.

2. Weak Label Generation via Heuristics and Generative Models

To mitigate click noise, pipelines employ several levels of weak labeling functions (LFs), each encoding domain heuristics, SME rules, or statistical priors. LFs may abstain, deliver binary or multi-class votes, and operate over token matching, industry alignment, seniority window, site-list membership, or click patterns (Vasudevan, 10 Mar 2025, Alexander et al., 2022). Aggregation of LF outputs is performed using generative models such as Snorkel's label model or custom exponential family constructions (Universal-WS) (Shin et al., 2021). These models estimate per-LF accuracies from small "golden" seed sets, assuming conditional independence or, for advanced applications, capturing LF correlations.

A typical generative formulation for binary relevancy is: $\mathrm{logit}(p) = \sum_{i=1}^m\log\frac{P(z_i|y=1)}{P(z_i|y=0)} + \log\frac{P(y=1)}{P(y=0)}$ where $z_i$ is the output of LF $i$ for a data point, and $p$ is interpreted as the weak irrelevance score.

For ranking, each click session can be framed as a noisy labeling function emitting a partial permutation, modeled by a (possibly heterogeneous) Mallows model around the true relevancy permutation via the Kendall– $\tau$ distance (Shin et al., 2021): $p(\lambda^s | y) = \frac{1}{Z(\theta_s)} \exp(-\theta_s\,d_\tau(\lambda^s, y))$ Universal-WS recovers unknown LF accuracies $\theta_s$ via method-of-moments (MoM) estimators on pairwise correlations of embedding vectors $g(\lambda^s)$ .

LF aggregation may utilize majority voting, generative probabilistic fusion, or weighted inference (e.g., weighted Kemeny aggregation in Universal-WS).

3. Model Architectures and Loss Function Design

Learning frameworks are generally deep neural networks optimized in listwise, pairwise, or pointwise fashion. Architectures ingest embedded representations—e.g., multi-aspect self-attention (MASM) encoders for product title/query tokens (Yao et al., 2021), multi-layer MLPs for preference or ranking scores (Dehghani et al., 2017), or transformer-based classifiers for intent (Alexander et al., 2022). For weak labels, the loss function is refactored to incorporate probabilistic or soft labels, as in: $y_{i,j}^{(\mathrm{WS})} = (1 - p_{ij})\,y_{i,j}^{(0)} + p_{ij}\,y_p$ where $p_{ij}$ is the weak "irrelevance" probability and $y_p$ the penalization score for suspected false positives (Vasudevan, 10 Mar 2025).

Controlled weak supervision, as in Dehghani et al., proceeds via a joint two-network architecture—a target network optimized on weakly labeled data and a confidence network trained to weight the gradients according to observed agreement between weak and true labels (Dehghani et al., 2017). The weighting factor $c_j = 1 - |y_j^\text{true} - \hat y_j^\text{weak}|$ pushes the model to rely less on noisy pairs.

For fine-grained confidence, positive and negative samples derived from clicks are bucketed into ranks (strong/relevant/weak positive, weak/strong negative), with corresponding "soft" target thresholds guiding loss function calculation. This calibration yields robust score distributions with well-separated confidence bands (Yao et al., 2021).

4. Empirical Evaluation and Production Metrics

Systems are evaluated using offline metrics such as NDCG@k, MAP@k, AUC (ROC), and PR-AUC, computed over held-out sets matched to weak or true labels. Illustrative results include:

NDCG@10 improvements of +34% to +42% on weak labels with minimal degradation on raw engagement (Vasudevan, 10 Mar 2025).
AUC/ROC jumps of 0.12 absolute and PR-AUC gains >0.15 over pairwise click baselines in e-commerce product relevance (Yao et al., 2021).
In Universal-WS, weighted Kemeny pseudolabels deliver +0.036 NDCG@3 over majority-vote aggregation, and can outperform fully supervised models trained on an order-of-magnitude fewer human labels (Shin et al., 2021).
Large-scale deployment (e.g., Taobao) demonstrates GMV +0.55% and annotation-based relevance uplift, running at 8 ms latency per request due to offline precomputation (Yao et al., 2021).

In intent classification, ORCAS-I achieves top-level intent accuracy of 0.902 and macro-F₁ of 0.822 via Snorkel-based rule aggregation, surpassing earlier rule-based and SVM baselines (see Table below) (Alexander et al., 2022).

Study	Dataset	Method	Accuracy
Kathuria et al. ’10	Dogpile log	k-means	0.94
Figueroa ’15	AOL log	MaxEnt (q+u)	0.822
This paper	ORCAS	Snorkel rules	0.902

5. Practical Deployment, Efficiency, and Scalability

Inference times for weak labeling approaches are compatible with large-scale web or e-commerce systems given batched precomputation, vectorized feature extraction, and simple LFs. ORCAS-I achieves per-query labeling in $\ll10$ milliseconds on commodity hardware and can scale to $>$ 18 million pairs in under 2 hours via embarrassingly parallel data partitioning (Alexander et al., 2022). Representation-based MASM encoders reduce online latency from $\sim50\,\text{ms}$ to $8\,\text{ms}$ per query after aspect vector precomputation (Yao et al., 2021).

A plausible implication is that weak supervision frameworks are robust against catastrophic noise and can yield well-calibrated scoring distributions if position bias and adversarial negative generation are properly addressed (Yao et al., 2021).

6. Limitations, Biases, and Future Directions

Principal limitations arise from click bias (positional, selection, stopping/cascade effects), idiosyncratic promotion or presentation artifacts, and dependence on the accuracy of position-bias calibration or negative sampling. UI affordances, such as dismiss/skip, may not cleanly map to semantic relevance. To mitigate these issues, avenues for future work include:

Expanding LF sets to include online-servable heuristics and large-language-model-generated functions (Vasudevan, 10 Mar 2025).
Incorporating more expressive generative models or learning LF dependencies beyond independence assumptions.
Dynamic thresholding or adaptive bucketing of confidence levels per query cluster (Yao et al., 2021).
Universal weak supervision approaches to support arbitrary label types, including regression, hyperbolic labels, and multiway taxonomies without redesigning synthesis algorithms (Shin et al., 2021).
Employing LLMs as “judges” to scale annotation capacity or refine LF rules.

This suggests a broader movement towards integrating weakly supervised pseudo-labeling from click logs with high-capacity neural architectures and generative calibration, offering scalable alternatives to manual annotation for information retrieval and e-commerce relevance modeling.