Papers
Topics
Authors
Recent
Search
2000 character limit reached

Weighted Contrastive Adaptation (WECA)

Updated 15 December 2025
  • Weighted Contrastive Adaptation (WECA) is an anomaly-aware training algorithm for time-series forecasting that balances invariance to benign perturbations with preservation of anomaly-specific features.
  • It employs a weighted InfoNCE contrastive loss where similarity weights, derived from input perturbation magnitudes, modulate the alignment between normal and anomaly-augmented samples.
  • Applications include enhanced forecasting robustness for ATM cash logistics, demonstrating significant SMAPE improvements on anomalous data while maintaining near-optimal performance on normal conditions.

Weighted Contrastive Adaptation (WECA) is a training algorithm for multivariate time-series forecasting that explicitly accounts for anomaly-awareness when learning representations, targeting improved performance under distribution shift scenarios without degrading accuracy on normal data. WECA introduces a weighted contrastive objective, interpolating between invariance to benign augmentations and the preservation of anomaly-specific information. Its principal application, as demonstrated in the context of nationwide ATM cash logistics, is to enhance forecaster robustness when encountering realistic, domain-informed anomaly events (Ekstrand et al., 8 Dec 2025).

1. Objective Formulation and Loss Derivation

WECA extends instance-wise contrastive learning (InfoNCE) for representations {zi,t}\{z_{i,t}\} of normal windows and {z~i,t}\{\tilde z_{i,t}\} of anomaly-augmented windows produced by an encoder gϕg_\phi. For each window ii and time tt, a similarity weight w(i,t)[0,1]w^{(i,t)} \in [0,1] scales the strength of alignment in the loss. The core weighted InfoNCE term is

WECA(i,t)=w(i,t)  logexp ⁣(zi,t,z~i,t/τ)j=1B[exp ⁣(zi,t,z~j,t/τ)+1jiexp ⁣(zi,t,zj,t/τ)],\ell_{\mathrm{WECA}}^{(i,t)} = -w^{(i,t)}\;\log \frac{\exp\!\bigl(\langle z_{i,t},\tilde z_{i,t}\rangle/\tau\bigr)} {\sum_{j=1}^B\left[\exp\!\bigl(\langle z_{i,t},\tilde z_{j,t}\rangle/\tau\bigr) + \mathbb{1}_{j\neq i} \exp\!\bigl(\langle z_{i,t},z_{j,t}\rangle/\tau\bigr) \right]},

where ,\langle\cdot,\cdot\rangle denotes the dot-product. The full batch loss combines the contrastive term and a mean absolute error (MAE) forecast loss with balance coefficient λ\lambda: L=1Bi=1BLforecast(i)+λ1BTi=1Bt=1TWECA(i,t),\mathcal{L} = \frac{1}{B}\sum_{i=1}^B \mathcal{L}_{\mathrm{forecast}}^{(i)} + \lambda \frac{1}{BT'} \sum_{i=1}^B \sum_{t=1}^{T'} \ell_{\mathrm{WECA}}^{(i,t)}, where

Lforecast(i)=1Hh=1Hyi,hy^i,h1,\mathcal{L}_{\mathrm{forecast}}^{(i)} = \frac{1}{H} \sum_{h=1}^H \lVert y_{i,h}-\hat y_{i,h} \rVert_1,

with decoder hψh_\psi, forecast targets yiy_i, and per-timestep outputs y^i,h=hψ(gϕ(xi))h\hat y_{i,h}=h_\psi(g_\phi(\mathbf{x}_i))_h.

2. Contrastive Pairing and Anomaly Augmentation

The contrastive setup defines, for each sample ii at position tt, a positive pair (zi,t,z~i,t)(z_{i,t},\tilde z_{i,t}) and negative pairs both as (zi,t,z~j,t)(z_{i,t},\tilde z_{j,t}) for all jij\neq i (augmented negatives), and (zi,t,zj,t)(z_{i,t},z_{j,t}) (other normal negatives). Anomaly-augmented samples are synthesized at the input level: for each xi\mathbf{x}_i, the last window positions (tail) are perturbed using an anomaly function a(n)=AneBnCa(n) = A\cdot n\cdot e^{-B n^C}, where (A,B,C)(A,B,C) are stochastically sampled to mimic historically observed anomaly statistics (e.g. AN(74120,200002)A\sim\mathcal{N}(74120, 20000^2), B=0.39B=0.39, CN(0.806,0.32)C\sim\mathcal{N}(0.806,0.3^2)). The generated anomaly a(n)a(n) is injected into the tail of xi\mathbf{x}_i and propagated into the forecast horizon, yielding x~i\tilde{\mathbf{x}}_i.

3. Weight Function Design and Principle

The weight w(i,t)w^{(i,t)} is a continuous function of the input-level perturbation magnitude, typically of the form w(i,t)=exp(αxix~i2)w^{(i,t)} = \exp(-\alpha \lVert \mathbf{x}_i-\tilde{\mathbf{x}}_i \rVert_2), with α\alpha chosen so domain-plausible (benign) variations yield w0.91.0w\approx 0.9\text{--}1.0 (strong invariance imposed), while major anomaly-like deviations yield w0.10.3w\approx 0.1\text{--}0.3 (weak or no invariance). This soft weighting causes the encoder to enforce invariance only for small (non-anomalous) perturbations, while explicitly retaining anomaly-specific features otherwise. Consequently, WECA can interpolate between pure contrastive learning (full invariance) and no contrastive adaptation, preserving anomaly detectability while improving normal operational robustness.

4. Training Protocol and Network Architecture

The WECA training loop proceeds as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
for each training step:
    1. Sample batch {(x_i, y_i)}_{i=1}^B
    2. For each i:
         - Sample anomaly parameters (A, B, C)
         - Generate anomaly a_i(n) = A*n*exp(B n^C)
         - Form augmented input x~_i = x_i + tail(a_i)
    3. Compute Z = g_phi({x_i}), Z~ = g_phi({x~_i})
    4. Compute ŷ_i = h_psi(z_i)  # normal branch
    5. Compute forecast loss L_forecast = (1/B) Σ_i MAE(y_i, ŷ_i)
    6. Compute weights w^(i,t) = exp(α * ||x_i  x~_i||_2)
    7. Compute L_WEC = (1/(B·T')) Σ_{i=1}^B Σ_{t=1}^{T'} ℓ_WECA^{(i,t)}
    8. Total loss: L = L_forecast + λ·L_WEC
    9. Backpropagate/update parameters via Adam (lr=1e3)

The backbone architecture is TimesNet, which achieves optimal symmetric mean absolute percentage error (SMAPE) on normal data. The contrastive head projects encoded features to a DD-dimensional space for dot-product similarity.

5. Hyperparameter Selection and Weighting Strategy

Selected hyperparameters:

  • Learning rate: 10310^{-3} (Adam optimizer)
  • Batch size: $128$
  • Contrastive weight λ\lambda: $1$ (tuned via validation)
  • InfoNCE temperature τ\tau: usually $0.1$ or absorbed into dot product scaling
  • Weighting function parameter α\alpha: tuned so that small (domain-plausible) perturbations produce w0.91.0w\approx 0.9\text{--}1.0 and large-scale anomalies w0.10.3w\approx 0.1\text{--}0.3
  • Early stopping: monitored on validation MAE

Editor's term: "Dynamic similarity weighting" refers to the weighting mechanism for anchoring contrastive invariance according to anomaly severity.

6. Empirical Evaluation and Performance

The primary empirical benchmark uses approximately 1,300 ATMs' daily withdrawal data over two years, with a rolling-origin 70/10/20 training/validation/test split. Anomaly-augmented data is generated using the described injection procedure.

Evaluation metric: SMAPE (%) on a 14-day forecast horizon.

Performance summary:

Method Normal Data (ND) SMAPE ± std Δ vs NT Anomaly Data (AD) SMAPE ± std Δ vs NT
NT 28.73 37.91
FT 31.50±0.87 +2.77 30.69±2.03 −7.22
CL-IL 28.62±0.97 −0.11 33.09±0.96 −4.82
WECA 28.70±1.00 −0.03 31.78±1.93 −6.13

WECA achieves a 6.13 percentage point reduction in SMAPE on anomaly-affected test data compared to the normally trained baseline (NT), while only marginally affecting normal-data performance (0.03 pp difference). Instance-level contrastive learning (CL-IL) offers some anomaly robustness but with weaker margins; fine-tuning on anomaly-only data (FT) yields the largest anomaly gain but with a significant loss on normal data, consistent with catastrophic forgetting (Ekstrand et al., 8 Dec 2025).

7. Implementation and Applicability Considerations

  • The anomaly injection process is fully reproducible with fixed random seeds and parameter distributions.
  • WECA approximately doubles encoder passes per batch (normal + augmented), but the total computational overhead remains under 20% of standard GPU training time.
  • The similarity weight w(i,t)w^{(i,t)} can be adapted to rely on alternative distance metrics or learned heuristics.
  • WECA is compatible with any deep forecaster architecture that exposes intermediate latent representations; no property is exclusive to TimesNet.
  • Training is most stable when the forecasting backbone is pre-trained on normal data and then fine-tuned with the WECA joint objective.
  • The method requires access to domain knowledge or distributions for realistic anomaly magnitude modeling, especially for applications like cash logistics where external events influence future behavior.

By calibrating the trade-off between invariance and anomaly awareness, WECA provides reliable forecasting resilience across severe real-world distribution shifts without compromising peacetime operational accuracy (Ekstrand et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weighted Contrastive Adaptation (WECA).