Delayed Feedback Modeling for GMV Prediction

Updated 4 February 2026

The paper introduces advanced methodologies such as multi-head online learning, delay bucket decomposition, and repurchase-aware dual-branch modeling to tackle delayed feedback in GMV prediction.
It emphasizes the importance of streaming updates, auxiliary feature calibration, and instance weighting to address censored labels and drift in dynamic e-commerce environments.
Empirical results from benchmarks like Taobao and Criteo validate these approaches by nearly closing the gap to oracle performance while effectively controlling bias.

Delayed feedback modeling for GMV (Gross Merchandise Value) prediction addresses the problem of forecasting the cumulative monetary value of purchases resulting from a user’s click, under conditions where purchase events may be temporally distributed and observed only after variable, potentially long, delays. This challenge is fundamental in online advertising and e-commerce platforms, where GMV serves both as a key business metric and as an optimization target for auction and ranking systems. The delayed feedback setting necessitates techniques that can handle censored or partially observed labels in a data stream while maintaining model freshness and unbiasedness.

1. Problem Definition and Statistical Structure

In post-click GMV prediction, a click $c=\langle x, t^c, \mathcal{P}\rangle$ is characterized by a feature vector $x\in\mathbb{R}^d$ , a click timestamp $t^c$ , and a sequence of purchases $\mathcal{P}=\{(t_i^p, p_i)\}_{i=1}^N$ occurring within an attribution window $w_a$ . The label of interest is the total GMV $y^* = \sum_{i=1}^N p_i$ . However, due to delayed feedback, at any wall-clock time $t$ , only $y^{(t)} = \sum_{t_i^p\leq t} p_i$ is observable; future purchases remain censored. Unlike classical CVR tasks, which are single-label and binary, GMV prediction under delayed feedback involves multi-purchase, continuous outcomes that challenge off-the-shelf supervised learning approaches (Li et al., 28 Jan 2026).

2. Benchmarks, Data Challenges, and Label Dynamics

The TRACE benchmark (Li et al., 28 Jan 2026) exemplifies the full complexity of GMV delayed feedback. TRACE contains over 7 million clicks (attribution window $w_a=7$ d) from Taobao ad logs, capturing the complete purchase sequences for each click. Notably, only about 40% of eventual GMV occurs immediately at click time, 60% within the first 24 hours, and the remainder accrues over several days. Repurchase (clicks with $\ge2$ orders) constitutes 53.6% of all clicks and exhibits distinct, heavy-tailed GMV distributions relative to single-purchase samples—mandating distinct treatment for these subpopulations. Analysis within TRACE shows that (a) model freshness is critical (streaming online learning outperforms daily-batch offline retraining) and (b) partial-label bias is pronounced (models naively trained on $y^{(t)}$ systematically underestimate $y^*$ ).

3. Core Modeling Paradigms

Delayed feedback GMV modeling systems aim to reconcile two competing demands:

Label freshness: Leverage new observations as soon as they appear to avoid stale feature distributions and concept drift.
Label completeness (unbiasedness): Utilization of only fully mature labels for unbiased estimation or sophisticated correction for censoring bias.

Several primary paradigms have been advanced:

Approach	Key Principle	Representative Paper
Multi-head / Delay-bucket	Partition delay axis into discrete windows, train separate sub-models per bucket, aggregate	(Gao et al., 2022)
Label splitting & thermometer targets	Split label into fixed and/or overlapping time buckets, train auxiliary models for remaining label	(Badanidiyuru et al., 2021)
Importance weighting (IS)	Correct for observed/censored sample bias via instance-weighted losses	(Chen et al., 2022, Gu et al., 2021)
Repurchase-aware dual-branch	Specialized expert prediction for single vs. repurchase patterns, routed by a learned classifier, with dynamic calibration of labels	(Li et al., 28 Jan 2026)

3.1 Multi-Head Online Learning (MHOL)

MHOL (Gao et al., 2022) slices the delay axis into $n$ non-overlapping windows $r_1,\ldots,r_n$ (e.g., (0,1]d, (1,2]d, etc.), covering roughly equal fractions of conversions. Each head $i$ is trained with labels $(y_{i}, v_{i})$ —conversion indicator and value within window $i$ —only after the corresponding delay threshold $t_i$ has lapsed. The GMV estimate aggregates predictions across all heads:

$\mathbb{E}[\mathrm{GMV}|x;\theta] = \sum_{h=1}^n \hat v_h(x;\theta)$

Fresh updates flow quickly for early heads, preserving recency, while late heads incorporate rare, slow conversions. Empirical results show that MHOL nearly closes the gap to oracle performance on Criteo data, with the lowest VPC (value per click) mean squared error among tested methods. Recommendations include choosing 4–6 heads covering 10–25% of conversions each, buffering clicks per head until the head’s maturity threshold, and tuning window edges based on the empirical conversion delay CDF (Gao et al., 2022).

3.2 Delay Buckets and Thermometer Encoding

“Handling many conversions per click” (Badanidiyuru et al., 2021) develops a delay-bucket decomposition: for click $p$ , bucket $i$ receives label $Y_{p,i} = \sum_{j: d_i \le D_{p,j} < d_{i+1}} W_{p,j}$ so that $y_p = \sum_{i=0}^n Y_{p,i}$ . Each $f_i$ predicts $Y_{p,i}$ using only fully mature (uncensored) data, yielding unbiased learning. Overlapping thermometer targets $T_{p,i} = \sum_{j: D_{p,j} \ge d_i} W_{p,j}$ further densify the learning signal for late buckets and enable $O(1)$ inference latency as only one model is evaluated per click. Auxiliary features such as “label so far” $L_{p,i}$ reduce variance and support adaptation to drift. On practical datasets, this approach achieves −8.6% relative Poisson log-loss (PLL) vs. mature-label baseline, with bias held under 1% (Badanidiyuru et al., 2021).

3.3 Importance Sampling and Unbiased Correction

The importance weighting (IS) methodology, exemplified in DEFUSE (Chen et al., 2022) and DEFER (Gu et al., 2021), structures the training stream as a mixture of immediate positives (IP), fake negatives (FN), real negatives (RN), and delayed positives (DP). DEFUSE, in particular, learns a separate classifier for distinguishing RN vs. FN among observed negatives and applies refined instance weights according to the empirical probability of each type (e.g., $z(x)$ , $f_{dp}(x)$ ). For GMV, the observed value $V(x)$ multiplies the loss terms for positive labels:

$L^{\text{unb}}(\theta) = \sum_{(x,v)\sim q} \left\{ v[1+\mathbb{I}_{IP}(x)f_{dp}(x)]V(x)\ell(1, f_\theta(x)) +(1-v)[z(x)f_{dp}(x)V(x)\ell(1, f_\theta(x))+\cdots] \right\}$

The bi-distribution variant (Bi-DEFUSE) separately trains in-window and out-window subtasks, yielding lower variance and faster convergence for short attribution windows ( $w_a \le 7$ days). Experimental results on Criteo and Taobao show 2–6% improvements in relative RI-AUC over prior methods, with the model approaching oracle-level bias for new advertisers (Chen et al., 2022). DEFER extends the IS principle by ingesting explicit real negatives after the full attribution window, further reducing conditional distribution drift and bias (Gu et al., 2021).

3.4 Repurchase-Aware Dual-branch Modeling

TRACE analysis (Li et al., 28 Jan 2026) motivates architectures that specialize for repurchase events. The READER model employs a shared-bottom dual-tower structure, with an MLP-based router predicting the probability of repurchase. The router soft-assigns each sample between a single-purchase expert $f_S$ and a repurchase expert $f_R$ , using thresholds $(\tau_1, \tau_2)$ and hybrid zone weighting for uncertainty. To mitigate bias from partial-label training, READER learns a log-gap calibrator that predicts the shift between observed partial GMV and final GMV, enabling pseudo-label correction. At window close, ground-truth alignment (GRA) and partial label unlearning (PLU) remove any inflation due to dynamic calibration. Empirical results on TRACE indicate that READER achieves AUC = 0.8235, ACC = 0.2612 (+2.19% over best baseline), and ALPR = 0.7523 (−6.88%) on the final 26-day test period (Li et al., 28 Jan 2026).

4. Training Regimes and Online Learning Protocols

Delayed feedback GMV modeling overwhelmingly leverages online learning protocols:

Streaming updates: Models are refreshed continuously as new purchases—partial labels—arrive. Streaming learning captures fast-evolving patterns, as shown by the AUC gap (0.8055 vs. 0.8165) between daily offline and streaming online learning in TRACE (Li et al., 28 Jan 2026).
Delayed/Windowed Label Release: For methods using bucketed or head-based partitioning (Gao et al., 2022, Badanidiyuru et al., 2021), labels are withheld from the corresponding model until the bucket’s time window (i.e., delay maturity) has elapsed, obviating imputation or zero-filling for censored observations.
Label calibration and debiasing: Especially for continuous GMV, specialized calibrator networks are trained to estimate time-varying label inflation/deflation, with counterfactual unlearning at window closure (Li et al., 28 Jan 2026).

All leading methods utilize a mixture of standard regression or log-loss objectives, instance weighting (for IS methods), and frequent evaluation against cumulative online metrics such as AUC, MAE, and log relative error.

5. Empirical Findings and Comparative Results

Experimental results across public (Criteo, TAOBAO) and proprietary datasets establish several consistent patterns:

Label correction approaches consistently outperform naive partial-label training. For example, on Criteo-30d, Bi-DEFUSE delivers up to +6% RI-AUC over baselines (Chen et al., 2022).
Multi-head and bucketed models (MHOL, thermometer targets) offer near-oracle performance when the number of buckets is matched to the empirical delay distribution (e.g., 4–6 buckets, each with 10–25% of conversions) (Gao et al., 2022, Badanidiyuru et al., 2021).
Repurchase-specific routing and calibration yield further gains: READER's ablations confirm that routing, gap calibration, and debiasing each improve AUC and regression accuracy, with soft/hybrid routing strictly better than hard splits (Li et al., 28 Jan 2026).
Bias is strictly controlled in the best models: thermometer encoding and importance-corrected dual-branch networks both maintain bias within ±1% on held-out, cold-start, and long-delay slices (Chen et al., 2022, Badanidiyuru et al., 2021).

6. Practical Considerations and Recommendations

Recommended system design guidelines include:

Delay window partitioning: Use historical delay CDFs to define delay buckets or heads so each contains 10–25% of conversions. Too few heads increase bias; too many increase training and maintenance cost (Gao et al., 2022, Badanidiyuru et al., 2021).
Auxiliary features: Include real-time “GMV so far” as an input; thermometer encoding leverages this for effective variance reduction and drift adaptation (Badanidiyuru et al., 2021).
Importance weighting: Necessary whenever the observed label stream deviates from the true distribution due to censoring or duplication. Ensure weights are based on empirical or model-inferred conversion/delay probabilities (Chen et al., 2022, Gu et al., 2021).
Streaming, not batch, updates: Streaming learning is strongly favored, as label distributions are strongly non-stationary over time (Li et al., 28 Jan 2026).
Calibration and debiasing at window closure: Ground-truth alignment and partial label unlearning should be performed when the attribution window ends to prevent cumulative pseudo-label errors (Li et al., 28 Jan 2026).
Separate modeling of repurchase patterns: Employ expert/tower structures with learned routing if the repurchase rate and purchase value distribution exhibit heterogeneity (Li et al., 28 Jan 2026).
Metric selection: Use both ranking (AUC) and regression (ACC, ALPR, PLL) metrics for comprehensive model evaluation.

7. Open Challenges and Research Directions

Multi-purchase temporal structure: As seen in TRACE, repurchase patterns introduce multi-modal, heavy-tailed behaviors that challenge linear bucketization or single-head models.
Drift adaptation: Intra-day seasonality and campaign drift necessitate adaptive bucket boundaries, dynamic feature augmentation, or continual learning variants (Li et al., 28 Jan 2026, Badanidiyuru et al., 2021).
Scalability to high-QPS, real-time bidding: Memory efficiency for per-bucket/head buffers and per-sample calibration must scale to billions of live events (Gao et al., 2022).
Unified treatment of inducible and exogenous delays: External (non-platform) delays, such as fulfillment lag, remain challenging to model and correct.
Generalization to other continuous, multi-event business metrics: The principles developed for GMV modeling—windowed bucketization, thermometer encoding, repurchase-aware routing—may transfer, but require careful empirical validation.

Delayed feedback modeling for GMV prediction is an active field, continually advancing toward systems that are unbiased, low-latency, robust to drift, and capable of fine-grained statistical control under complex, censored data streams (Gao et al., 2022, Badanidiyuru et al., 2021, Chen et al., 2022, Gu et al., 2021, Li et al., 28 Jan 2026).