On-the-fly Time Series Data Augmentation

Updated 18 January 2026

The paper demonstrates that integrating on-the-fly augmentation in mini-batch SGD enhances model generalization by dynamically mixing synthetic and real data samples.
OnDAT employs a range of time series transformations, such as jittering and scaling, with fine-tuned hyperparameters to optimize data diversity.
Empirical results reveal statistically significant RMSE improvements across datasets, confirming OnDAT's effectiveness for forecasting and classification tasks.

On-the-fly Data Augmentation for Time Series (OnDAT) refers to the paradigm of generating and integrating synthetic time series data directly within the model training loop, eliminating the need for large, pre-augmented datasets. This approach is distinct from traditional offline augmentation, ensuring a balanced and continually refreshed mixture of real and synthetic examples during each training iteration. OnDAT is motivated by the limited availability of time series data in many forecasting and classification applications, the risk of overfitting with small datasets, and the inefficiencies inherent to static augmentation pipelines. Empirical results demonstrate that OnDAT yields statistically significant improvements in model generalization across a wide range of neural architectures and benchmarks (Cerqueira et al., 2024).

1. Formal Problem Definition

The general time series forecasting problem consists of learning a mapping from observed historical windows to a prediction horizon, i.e., given $\{x^{(i)}_{1:T}\}_{i=1}^N$ (each $x^{(i)}$ is a univariate time series), the model $f_\theta$ is trained to minimize a loss

$\mathcal L(\theta)=\frac1M\sum_{i=1}^M\ell\bigl(f_\theta(x^{(i)}_{t-w+1:t}),\,x^{(i)}_{t+1:t+H}\bigr)$

over rolling windows of size $w$ predicting $H$ future steps. For data augmentation, a synthetic-data generator $G$ operates on these supervised pairs, outputting $(\tilde x_{t-w+1:t},\,\tilde x_{t+1:t+H})$ . The augmentation ratio $\alpha\in[0,1]$ controls the fraction of synthetic samples in each mini-batch. On-the-fly augmentation is integrated into mini-batch stochastic gradient descent (SGD), where synthetic samples are dynamically generated for each batch and not stored persistently (Cerqueira et al., 2024).

2. Algorithms and Frameworks

A canonical OnDAT algorithm operates as follows:

initialize θ
for each epoch:
    for each mini-batch B_real of size B:
        N_syn ← ceil(α·B)
        select N_syn indices from B_real
        B_syn ← {G(x_i, y_i) for i in selected indices}
        B ← B_real ∪ B_syn
        L ← (1/|B|) sum_{(x, y) in B} ℓ(f_θ(x), y)
        θ ← θ - η ∇_θ L

This structure ensures that every batch has a specified proportion of synthetic examples, derived freshly and consistently with the original data distribution. OnDAT can be further enhanced by sample-adaptive mechanisms, such as learned weighting of losses for each augmented variant (W-Augment), ranking-based selection (α-trimmed augment), or gating networks that learn sample-wise contribution of each transformation branch (Oba et al., 2021, Fons et al., 2021).

3. Synthetic Generation Techniques

OnDAT leverages a diverse palette of time series transformations. Core operators include:

Operator	Transformation Description	Key Hyperparameters
Jittering	Additive Gaussian noise: $x_t + \varepsilon_t,\,\varepsilon_t\sim N(0,\sigma^2)$	Noise std $\sigma$
Scaling	Multiplicative scaling: $\lambda x_t,\; \lambda\sim U[\lambda_{min},\lambda_{max}]$	$\lambda_{min}, \lambda_{max}$
Permutation	Shuffle $M$ equal segments: $[S_{\pi(1)}, ..., S_{\pi(M)}]$	Segments $M$
Magnitude Warping	Multiply by smooth curve: $x_t \cdot g(t)$	Control points, $\sigma_w$
Time Warping	Time axis distortion via monotonic spline $\tau(t)$ : $x_{\tau(t)}$	Warp strength
Window Warping	Stretch/compress random window	$\kappa_{min}$ , $\kappa_{max}$ , window length
Random Cropping	Crop/rescale subseries	Crop length $q$

Variants include more specialized operators (e.g., window slicing, magnitude warping with exponential curves) and the composition of multiple operators per batch (Cerqueira et al., 2024, Malialis et al., 2022, Oba et al., 2021).

4. Applications: Forecasting and Online Learning

OnDAT frameworks are agnostic to backbone model architecture. Demonstrated applications include:

Forecasting: LSTM-based global models, TCNs, and Transformer encoder-decoder models, all using autoregressive windows and output heads suitable for next-step or multi-step forecasting tasks (Cerqueira et al., 2024, Zhang et al., 2024).
Classification: Application to streaming settings using online active learning and a multi-queue memory, where each class maintains a small FIFO buffer; augmented examples are generated on label queries and merged with the real memory for each update, providing robust generalization under memory and budget constraints (Malialis et al., 2022).
Concept Drift Adaptation: D³A employs on-the-fly Gaussian noise injection into stored windows to broaden support and mitigate train/test distribution gaps during concept drift in streaming forecasting (Zhang et al., 2024).

5. Empirical Results and Benchmarking

Comprehensive evaluations have established the empirical superiority of OnDAT relative to both no-augmentation and offline augmentation:

Dataset	No Augmentation	Offline DA	OnDAT
Electricity	0.102	0.098	0.094
Traffic	0.156	0.151	0.147
Solar-Energy	0.089	0.087	0.083
Exchange-Rate	0.021	0.020	0.019
M4 Monthly	0.217	0.213	0.208
Tourism Mthly	0.305	0.298	0.289

Statistical analysis (paired Wilcoxon, $p<0.01$ ) confirms significant performance gains, with OnDAT achieving a 4.3% average RMSE improvement over the non-augmented baseline (Cerqueira et al., 2024). Variant algorithms (e.g., gating networks, W-Augment, α-trim) further enhance sample efficiency, demonstrate robust performance under nonstationarity, and adaptively exploit transform-specific utility across datasets (Oba et al., 2021, Fons et al., 2021).

6. Design Principles and Best Practices

Batch Mixing Ratio: For best generalization, set $\alpha \in [0.3, 0.6]$ ; larger $\alpha$ can induce underfitting to real patterns (Cerqueira et al., 2024).
Operator Selection: Employ multiple, efficient, and differentiable transforms. STL+MBB is preferred for low-frequency data, while lighter transforms (jittering, scaling) are preferable for long sequences.
Validation and Early Stopping: Apply augmentation at validation time to obtain unbiased performance estimates and prevent early stopping bias.
Compositionality: Mixing operators within a batch increases coverage but requires tuning of each operator’s hyperparameters on a held-out validation set.
Resource Tradeoffs: OnDAT avoids the storage and I/O overhead of pre-augmented datasets; additional computational overhead is linear in the number of augmentations per batch.
Active Data Selection: In streaming classification, augment after each label query, keeping buffer size and budget small while maintaining class balance (Malialis et al., 2022).

7. Open Challenges and Limitations

Overhead: Linear scaling of per-batch computational cost with the number of transforms (choose $N\leq 8$ in practice).
Transform Utility: Not all operators are uniformly beneficial; automated weighting (W-Augment), sample-wise gating, or ranking-based methods partially mitigate the risk of unhelpful augmentations (Fons et al., 2021, Oba et al., 2021).
Parameter Tuning: Requires careful hyperparameter search over augmentation strengths and mixing ratios, though the grid is typically low-dimensional.
Drift Adaptivity: While On-the-fly augmentation is effective under drift, its interaction with longer-term nonstationarities or adversarial distribution shifts merits further study. D³A provides theoretical guarantees for Gaussian augmentation closing the generalization gap under concept shift (Zhang et al., 2024).

8. Representative Implementations

Four major OnDAT instantiations:

Reference	Core Augmentation Mechanism	Domain
(Cerqueira et al., 2024)	Dynamic ratio per mini-batch, $7$ operator palette	Forecasting
(Oba et al., 2021)	Sample-adaptive gating over augmentation branches	Recognition/classification
(Zhang et al., 2024)	On-the-fly Gaussian noise injection, concept drift adaptation	Online forecasting
(Malialis et al., 2022)	Active learning + augmented queues + online SGD	Data stream classification

All maintain in-loop data augmentation, memory efficiency, and adaptability to streaming or nonstationary regimes.

On-the-fly data augmentation for time series provides a unified, memory-efficient, and empirically validated framework for improving generalization in deep learning-based forecasting and classification. The integration of dynamic synthetic sample generation with online or batch optimization delivers consistent improvements across architectures and domains, with extensions for sample-adaptive weighting and concept drift mitigation further enhancing robustness and sample efficiency (Cerqueira et al., 2024, Oba et al., 2021, Fons et al., 2021, Malialis et al., 2022, Zhang et al., 2024).