Simple MixUp for Time-Series
- The paper introduces a vicinal risk minimization approach that generates virtual time-series examples by convexly mixing paired inputs and labels in raw or latent spaces.
- It demonstrates improved performance across classification, self-supervised learning, and transfer scenarios using standard architectures without requiring domain-specific tuning.
- Empirical results on diverse datasets validate enhanced robustness, scalability, and better model calibration by smoothing decision boundaries.
Embarrassingly simple MixUp for time-series is a vicinal risk-based data augmentation paradigm that synthesizes virtual examples by convexly mixing pairs of time-series and their corresponding labels in either raw or latent space. By leveraging the element-wise, label-weighted interpolation, MixUp regularizes neural models for time-series classification, self-supervised representation learning, and transfer scenarios without requiring domain-specific transformations or hyperparameter tuning. This methodology has been further extended and systematically evaluated across numerous architectures, datasets, and training regimes, positioning MixUp and its latent variants as robust and scalable augmentation baselines in the time-series domain.
1. Formalization and Methodological Variants
The MixUp procedure operates on pairs of labeled time-series and . A mixing coefficient or is sampled, then the synthetic input and label are constructed as: where for multivariate sequences. Label mixing creates soft targets, encouraging smooth decision boundaries and mitigating overconfidence.
MixUp++ introduces two modifications: (i) all minibatches include both real and mixed examples; (ii) distinct random MixUp pairs are generated per batch using independent permutations and mix ratios. The supervised loss is
LatentMixUp++ replaces input mixing with hidden feature mixing. For encoder , projection head , and input pair ,
The prediction is via . This variant leverages more linear latent manifolds, benefiting from model-internal representations (Aggarwal et al., 2023).
2. Integration into Contrastive and Supervised Frameworks
In supervised classification, MixUp or its variants are applied as preprocessing per minibatch and require only modifications to the input pipeline and loss function (cross-entropy on soft labels). Integration is architecture-agnostic, applicable to CNNs (e.g., InceptionTime, FCN), RNNs, Transformers, and ResNet-style 1D CNNs (Aggarwal et al., 2023, Yang et al., 2022, Guo et al., 2023).
In self-supervised or contrastive learning, MixUp is combined with triplet or contrastive losses by mixing views before encoder passage (Wickstrøm et al., 2022). The MixUp-Normalized Temperature-Scaled Cross-Entropy (MNT-Xent) loss is used: where denotes cosine similarity and the temperature parameter. The positive weights are soft, directly encoding the mix ratio (Wickstrøm et al., 2022).
3. Implementation Details and Hyperparameters
MixUp for time-series is "embarrassingly simple" in that no domain-specific expertise or sensitive tuning is required:
- Mixing coefficient: , commonly or .
- Number of mixes per batch: or ; can degrade performance.
- Networks: Standard architectures (FCN, InceptionTime, ResNet-18, Transformers) without architectural changes.
- Training: Adam optimizer (), batch size 128–256, up to 1000 epochs.
- Minimal additional computational overhead, with 1 ms per batch in practice (Guo et al., 2023, Wickstrøm et al., 2022, Aggarwal et al., 2023).
- Requires time-series to be resampled or padded to common length for element-wise mixing (Guo et al., 2023).
Semi-supervised extensions include pseudo-labeling: confident model predictions above threshold (e.g., 0.99) on unlabeled data are included, and MixUp is applied jointly over labeled and pseudo-labeled data (Aggarwal et al., 2023).
4. Empirical Results
The effectiveness of MixUp and its variants has been extensively validated across multivariate, univariate, and physiological time-series datasets. The following summarizes key results from representative studies:
| Dataset Family | Baseline Acc (%) | MixUp Acc (%) | LatentMixUp++ (best) | Reference |
|---|---|---|---|---|
| UCI-HAR | 92.95 ± 0.83 | 92.63 ± 0.56 | 94.44 ± 0.72 | (Aggarwal et al., 2023) |
| Sleep-EDF | 80.57 ± 0.34 | 79.14 ± 0.96 | 81.12 ± 0.47 | (Aggarwal et al., 2023) |
| PTB-XL | 77.94 | 78.91 | — | (Guo et al., 2023) |
| PAMAP2 | 93.62 | 95.31 | — | (Guo et al., 2023) |
Additional findings:
- On the 128 UCR (univariate) and 30 UEA (multivariate) datasets, MixUp contrastive pretraining achieved $0.759$ (UCR) and $0.627$ (UEA) kNN accuracy, exceeding all baselines (Wickstrøm et al., 2022).
- Significant improvements (up to +10.5% points) observed where interpolated series remain plausible (continuous signals) (Yang et al., 2022).
- LatentMixUp++ yields the largest benefits in low-label regimes: up to +15% relative improvement with only 1% labeled data; ablation shows diminishing gains as increases beyond 2 due to overwhelming synthetic samples (Aggarwal et al., 2023).
- Mix-based methods outperformed single-sample augmentations (jitter, scale, warping), and CutMix sometimes provided further gains, especially in class-imbalanced settings (Guo et al., 2023).
5. Analysis of Efficacy and Domain Adaptation
Several properties underlie the robustness and transferability of MixUp-based approaches:
- No domain-specific assumptions or transformations are required; MixUp performs consistently across diverse sensor modalities and application domains by interpolating in raw or latent spaces (Guo et al., 2023).
- Vicinal Risk Minimization: Mixing expands the local neighborhood manifold, filling gaps between real samples and regularizing the learned function.
- Soft-label smoothing: Mixed labels prevent overfitting to hard class boundaries, improve calibration, and act similarly to distillation (Wickstrøm et al., 2022).
- Latent space interpolation: LatentMixUp++ leverages more linear, task-aligned representations, mitigating issues with destructive raw-space interpolation (e.g., cancellation under phase shifts).
- Minimal sensitivity to hyperparameters: A single value is effective across datasets; no extensive parameter search is required (Guo et al., 2023).
MixUp's effectiveness is greatest when interpolated samples remain on-manifold and the datasets benefit from regularized decision boundaries. Detrimental effects can arise if interpolation yields unrealistic signals, such as when discriminative features are highly localized or sharp spiking patterns are averaged out (Yang et al., 2022).
6. Extensions, Variants, and Best Practices
Beyond raw-space MixUp, related mix-based augmentations have been formulated:
- CutMix: Random segments ("patches") are swapped between series, preserving uninterpolated context and further diversifying signal morphologies. In time-series, contiguous intervals are exchanged across all channels (Guo et al., 2023).
- Manifold MixUp: Mixing occurs in hidden layers deeper in the network, introducing additional regularization by perturbing high-level feature representations (Guo et al., 2023).
- Semi-supervised MixUp: Combining labeled data and pseudo-labeled unlabeled samples with mix-based augmentation yields further gains, especially in data-scarce regimes (Aggarwal et al., 2023).
- Multiple MixUp per batch: Empirical analysis recommends for jointly optimizing over real and synthetic pairs; higher can induce over-regularization (Aggarwal et al., 2023).
Best practices:
- Standardize sequence lengths across minibatch.
- Apply MixUp per minibatch after shuffling; use a fixed .
- Include both real and mixed samples in each training batch.
- For unbalanced classes, combine MixUp with class-balanced sampling.
- Integrate CutMix or Manifold MixUp as needed for increased augmentation diversity.
7. Limitations and Ongoing Challenges
While "embarrassingly simple" MixUp strategies are robust and adaptable, several open issues remain:
- Interpolation realism: For tasks with highly non-overlapping temporal events, MixUp can degrade discriminative features.
- Performance with many synthetic samples: Excessive augmentation relative to real data can harm performance; careful selection of is advised (Aggarwal et al., 2023).
- Extensibility: Application to other time-series tasks (forecasting, anomaly detection), non-convolutional architectures, or integration with consistency-based semi-supervised objectives is an open direction (Aggarwal et al., 2023).
- Efficiency: Computational cost increases linearly with ; sampling strategies or dynamic mix ratios may offer improvements.
In summary, MixUp and its latent and manifold extensions provide universal, high-performing augmentation strategies for time-series learning with minimal domain expertise or tuning required, consistently improving classification and representation learning outcomes across varied benchmark and clinical datasets (Aggarwal et al., 2023, Guo et al., 2023, Wickstrøm et al., 2022, Yang et al., 2022).