Weighted Contrastive Adaptation (WECA)

Updated 15 December 2025

Weighted Contrastive Adaptation (WECA) is an anomaly-aware training algorithm for time-series forecasting that balances invariance to benign perturbations with preservation of anomaly-specific features.
It employs a weighted InfoNCE contrastive loss where similarity weights, derived from input perturbation magnitudes, modulate the alignment between normal and anomaly-augmented samples.
Applications include enhanced forecasting robustness for ATM cash logistics, demonstrating significant SMAPE improvements on anomalous data while maintaining near-optimal performance on normal conditions.

Weighted Contrastive Adaptation (WECA) is a training algorithm for multivariate time-series forecasting that explicitly accounts for anomaly-awareness when learning representations, targeting improved performance under distribution shift scenarios without degrading accuracy on normal data. WECA introduces a weighted contrastive objective, interpolating between invariance to benign augmentations and the preservation of anomaly-specific information. Its principal application, as demonstrated in the context of nationwide ATM cash logistics, is to enhance forecaster robustness when encountering realistic, domain-informed anomaly events (Ekstrand et al., 8 Dec 2025).

1. Objective Formulation and Loss Derivation

WECA extends instance-wise contrastive learning (InfoNCE) for representations $\{z_{i,t}\}$ of normal windows and $\{\tilde z_{i,t}\}$ of anomaly-augmented windows produced by an encoder $g_\phi$ . For each window $i$ and time $t$ , a similarity weight $w^{(i,t)} \in [0,1]$ scales the strength of alignment in the loss. The core weighted InfoNCE term is

$\ell_{\mathrm{WECA}}^{(i,t)} = -w^{(i,t)}\;\log \frac{\exp\!\bigl(\langle z_{i,t},\tilde z_{i,t}\rangle/\tau\bigr)} {\sum_{j=1}^B\left[\exp\!\bigl(\langle z_{i,t},\tilde z_{j,t}\rangle/\tau\bigr) + \mathbb{1}_{j\neq i} \exp\!\bigl(\langle z_{i,t},z_{j,t}\rangle/\tau\bigr) \right]},$

where $\langle\cdot,\cdot\rangle$ denotes the dot-product. The full batch loss combines the contrastive term and a mean absolute error (MAE) forecast loss with balance coefficient $\lambda$ : $\mathcal{L} = \frac{1}{B}\sum_{i=1}^B \mathcal{L}_{\mathrm{forecast}}^{(i)} + \lambda \frac{1}{BT'} \sum_{i=1}^B \sum_{t=1}^{T'} \ell_{\mathrm{WECA}}^{(i,t)},$ where

$\mathcal{L}_{\mathrm{forecast}}^{(i)} = \frac{1}{H} \sum_{h=1}^H \lVert y_{i,h}-\hat y_{i,h} \rVert_1,$

with decoder $h_\psi$ , forecast targets $y_i$ , and per-timestep outputs $\hat y_{i,h}=h_\psi(g_\phi(\mathbf{x}_i))_h$ .

2. Contrastive Pairing and Anomaly Augmentation

The contrastive setup defines, for each sample $i$ at position $t$ , a positive pair $(z_{i,t},\tilde z_{i,t})$ and negative pairs both as $(z_{i,t},\tilde z_{j,t})$ for all $j\neq i$ (augmented negatives), and $(z_{i,t},z_{j,t})$ (other normal negatives). Anomaly-augmented samples are synthesized at the input level: for each $\mathbf{x}_i$ , the last window positions (tail) are perturbed using an anomaly function $a(n) = A\cdot n\cdot e^{-B n^C}$ , where $(A,B,C)$ are stochastically sampled to mimic historically observed anomaly statistics (e.g. $A\sim\mathcal{N}(74120, 20000^2)$ , $B=0.39$ , $C\sim\mathcal{N}(0.806,0.3^2)$ ). The generated anomaly $a(n)$ is injected into the tail of $\mathbf{x}_i$ and propagated into the forecast horizon, yielding $\tilde{\mathbf{x}}_i$ .

3. Weight Function Design and Principle

The weight $w^{(i,t)}$ is a continuous function of the input-level perturbation magnitude, typically of the form $w^{(i,t)} = \exp(-\alpha \lVert \mathbf{x}_i-\tilde{\mathbf{x}}_i \rVert_2)$ , with $\alpha$ chosen so domain-plausible (benign) variations yield $w\approx 0.9\text{--}1.0$ (strong invariance imposed), while major anomaly-like deviations yield $w\approx 0.1\text{--}0.3$ (weak or no invariance). This soft weighting causes the encoder to enforce invariance only for small (non-anomalous) perturbations, while explicitly retaining anomaly-specific features otherwise. Consequently, WECA can interpolate between pure contrastive learning (full invariance) and no contrastive adaptation, preserving anomaly detectability while improving normal operational robustness.

4. Training Protocol and Network Architecture

The WECA training loop proceeds as follows:

for each training step:
    1. Sample batch {(x_i, y_i)}_{i=1}^B
    2. For each i:
         - Sample anomaly parameters (A, B, C)
         - Generate anomaly a_i(n) = A*n*exp(−B n^C)
         - Form augmented input x~_i = x_i + tail(a_i)
    3. Compute Z = g_phi({x_i}), Z~ = g_phi({x~_i})
    4. Compute ŷ_i = h_psi(z_i)  # normal branch
    5. Compute forecast loss L_forecast = (1/B) Σ_i MAE(y_i, ŷ_i)
    6. Compute weights w^(i,t) = exp(−α * ||x_i − x~_i||_2)
    7. Compute L_WEC = (1/(B·T')) Σ_{i=1}^B Σ_{t=1}^{T'} ℓ_WECA^{(i,t)}
    8. Total loss: L = L_forecast + λ·L_WEC
    9. Backpropagate/update parameters via Adam (lr=1e−3)

The backbone architecture is TimesNet, which achieves optimal symmetric mean absolute percentage error (SMAPE) on normal data. The contrastive head projects encoded features to a $D$ -dimensional space for dot-product similarity.

5. Hyperparameter Selection and Weighting Strategy

Selected hyperparameters:

Learning rate: $10^{-3}$ (Adam optimizer)
Batch size: $128$
Contrastive weight $\lambda$ : $1$ (tuned via validation)
InfoNCE temperature $\tau$ : usually $0.1$ or absorbed into dot product scaling
Weighting function parameter $\alpha$ : tuned so that small (domain-plausible) perturbations produce $w\approx 0.9\text{--}1.0$ and large-scale anomalies $w\approx 0.1\text{--}0.3$
Early stopping: monitored on validation MAE

Editor's term: "Dynamic similarity weighting" refers to the weighting mechanism for anchoring contrastive invariance according to anomaly severity.

6. Empirical Evaluation and Performance

The primary empirical benchmark uses approximately 1,300 ATMs' daily withdrawal data over two years, with a rolling-origin 70/10/20 training/validation/test split. Anomaly-augmented data is generated using the described injection procedure.

Evaluation metric: SMAPE (%) on a 14-day forecast horizon.

Performance summary:

Method	Normal Data (ND) SMAPE ± std	Δ vs NT	Anomaly Data (AD) SMAPE ± std	Δ vs NT
NT	28.73	—	37.91	—
FT	31.50±0.87	+2.77	30.69±2.03	−7.22
CL-IL	28.62±0.97	−0.11	33.09±0.96	−4.82
WECA	28.70±1.00	−0.03	31.78±1.93	−6.13

WECA achieves a 6.13 percentage point reduction in SMAPE on anomaly-affected test data compared to the normally trained baseline (NT), while only marginally affecting normal-data performance (0.03 pp difference). Instance-level contrastive learning (CL-IL) offers some anomaly robustness but with weaker margins; fine-tuning on anomaly-only data (FT) yields the largest anomaly gain but with a significant loss on normal data, consistent with catastrophic forgetting (Ekstrand et al., 8 Dec 2025).

7. Implementation and Applicability Considerations

The anomaly injection process is fully reproducible with fixed random seeds and parameter distributions.
WECA approximately doubles encoder passes per batch (normal + augmented), but the total computational overhead remains under 20% of standard GPU training time.
The similarity weight $w^{(i,t)}$ can be adapted to rely on alternative distance metrics or learned heuristics.
WECA is compatible with any deep forecaster architecture that exposes intermediate latent representations; no property is exclusive to TimesNet.
Training is most stable when the forecasting backbone is pre-trained on normal data and then fine-tuned with the WECA joint objective.
The method requires access to domain knowledge or distributions for realistic anomaly magnitude modeling, especially for applications like cash logistics where external events influence future behavior.

By calibrating the trade-off between invariance and anomaly awareness, WECA provides reliable forecasting resilience across severe real-world distribution shifts without compromising peacetime operational accuracy (Ekstrand et al., 8 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Weighted Contrastive Learning for Anomaly-Aware Time-Series Forecasting (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weighted Contrastive Adaptation (WECA).

Weighted Contrastive Adaptation (WECA)

1. Objective Formulation and Loss Derivation

2. Contrastive Pairing and Anomaly Augmentation

3. Weight Function Design and Principle

4. Training Protocol and Network Architecture

5. Hyperparameter Selection and Weighting Strategy

6. Empirical Evaluation and Performance

7. Implementation and Applicability Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Weighted Contrastive Adaptation (WECA)

1. Objective Formulation and Loss Derivation

2. Contrastive Pairing and Anomaly Augmentation

3. Weight Function Design and Principle

4. Training Protocol and Network Architecture

5. Hyperparameter Selection and Weighting Strategy

6. Empirical Evaluation and Performance

7. Implementation and Applicability Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research