DropoutTS: Adaptive Dropout for Time Series
- DropoutTS comprises adaptive dropout techniques that adjust regulation based on per-sample noise and training phase for improved model robustness.
- It employs sample-adaptive dropout via spectral noise scoring and time-dependent scheduling to mitigate early over-regularization and enhance convergence.
- The framework also integrates ensemble-based sample selection and Tabu Dropout to diversify network activations while minimizing computational overhead.
DropoutTS encompasses a spectrum of sample-adaptive and time-adaptive dropout approaches that modulate neural network regularization rates based on either instance-wise noise or the optimization curriculum. Originally developed in the context of time series prediction, robust learning under noisy supervision, and more general neural network regularization, DropoutTS methods improve generalization and robustness by shifting from a global, fixed dropout rate to mechanisms where the regularization strength is tuned to the properties of each sample or to the training phase. The principal instantiations are adaptive dropout for robust time series forecasting, curriculum (time-scheduled) dropout, and dropout-based ensemble selection strategies.
1. Sample-Adaptive Dropout for Time Series Forecasting
DropoutTS, as introduced for time series applications, is a model-agnostic, “capacity-centric” module that adapts per-sample dropout rates according to the estimated noise level of each input instance (Zhong et al., 29 Jan 2026). The method computes a real-time, differentiable noise score for each sequence using a spectral reconstruction pipeline, then maps this score to a dropout probability, and finally applies sample-specific dropout during network training.
Spectral Noise Scoring
- Detrend the input using an OLS-fitted linear trend and subtract it.
- Compute FFT: For detrended data, obtain the channel-wise amplitude and log-amplitude spectrum.
- Spectral Flatness Measure (SFM): Quantify the spectrum's “whiteness,”
- Adaptive Thresholding: Use an MLP to generate a per-instance threshold in spectral space.
- Spectral Masking: Via a soft threshold, attenuate frequency components deemed noisy, reconstructing the “clean” signal by IFFT.
- Noise Score: The mean absolute residual between original and reconstructed input,
quantifies input corruption without external annotation.
Mapping Noise to Dropout Rate
Let denote per-sample noise levels, min-max normalized within the batch to . Through a learned , the probability is
Clean samples () receive ; noisy samples () receive . Dropout is then applied using a Masked Bernoulli mechanism with straight-through estimation to enable backpropagation through .
Forward Pass and Implementation
Each network dropout layer replaces standard dropout with sample-adaptive masking:
All spectral modules are removed at inference, with fixed.
2. Time-Dependent Dropout Scheduling (Curriculum Dropout)
Curriculum Dropout, alternately referenced as DropoutTS in the scheduling context, implements a time-dependent schedule for the retention probability , replacing the static rate in conventional dropout protocols (Morerio et al., 2017).
Retention Schedule
The exponential curriculum is defined by
where is the total number of training steps and is the final retention level ($0.5$ for dense, $0.75$ for convolutional layers).
Curriculum-Learning Perspective
Setting suppresses regularization at initialization. As increases, higher proportions of units are stochastically zeroed, making the task incrementally more difficult in the curriculum learning sense. The entropy of the input distribution increases over training, ensuring graduated exposure to more challenging learning conditions.
Regularization Analysis
For static dropout, regularization weakly depends on . A time-varying introduces an annealed regularizer,
with zero penalty early and asymptotic approach to the target regularization strength. This optimally avoids excessive constraint during initial convergence and increases protection against overfitting during later optimization.
3. DropoutTS in Ensemble and Sample Selection Methods
DropoutTS provides an architectural replacement for dual-network training paradigms such as Co-teaching+ and JoCor, particularly under label noise (Lakshya, 2022). Rather than operating two independent models for clean-sample selection, DropoutTS utilizes two independent dropout mask realizations over a single parameter set, thereby simulating an exponential ensemble and enabling efficient sample selection.
Algorithmic Structure
For a batch :
- Two forward passes are performed using independent Bernoulli masks, yielding two sets of logits/losses.
- The “peer” network's loss is used for “small-loss” clean sample selection.
- A single backward pass updates shared parameters only on the subset of samples selected as having the lowest loss under the alternative mask.
This mechanism reduces memory requirements by and decreases computational cost per batch by omitting one backward pass.
4. Tabu Dropout: Diversification by Tabu Strategy
Tabu Dropout (sometimes labeled “DropoutTS” in the literature) enforces diversity in regularization by disallowing consecutive suppression of the same unit (Ma et al., 2018). It maintains a short-term tabu list (track record of dropped/retained neurons in each layer). If neuron was dropped in the previous forward pass (), it must be retained in the current pass; otherwise, standard Bernoulli sampling applies. The evolution of the effective dropout rate obeys with initial .
This method enhances network exploration across the implicit dropout-induced ensemble and further mitigates co-adaptation, as all units participate with guaranteed frequency. The overhead is negligible—a single extra per-unit Boolean memory and comparison per forward pass.
5. Empirical Results and Robustness Analysis
Robustness in Time Series
Extensive evaluation demonstrates that DropoutTS achieves robust performance gains across both synthetic and real-world time series datasets (Zhong et al., 29 Jan 2026). For example, on “Synth-12” (varying noise and non-stationarities), Informer [Informer] with DropoutTS obtains a average MSE reduction; Crossformer ; PatchTST ; TimesNet . On public multivariate datasets (ETTh1/2, ETTm1/2, Electricity, Weather, ILI), relative MSE reductions averaged for Informer, for Crossformer, and for TimesNet on ILI.
Ablation confirms the necessity of all pipeline components: omitting OLS-detrending yields MSE, removing spectral normalization , and removing SFM . Cumulatively, adaptive dropout provides improvement over any fixed .
Generalization in Curriculum Dropout
On image classification, Curriculum Dropout matches or exceeds standard dropout’s generalization across a range of datasets and architectures (see Table 1 reproduced below from (Morerio et al., 2017)):
| Dataset & Architecture & Standard Dropout Gain & Curriculum Dropout Gain (Boost) |
|---|
| MNIST–MLP |
| Double-MNIST |
| CIFAR-10 |
| Caltech-101 |
Curriculum Dropout avoids over-regularization at early epochs, resolves the instability seen in abrupt “switch” schedules, and yields lower variance in test accuracy.
Sample Selection Efficiency
Within noisy label regimes, DropoutTS (exponential ensemble approximation) delivers up to gain in test accuracy (MNIST, pairflip ) over standard Co-teaching+, and up to for JoCor on CIFAR-100 ( symmetric noise) (Lakshya, 2022). Memory and computational overhead are substantially reduced.
Tabu Dropout demonstrates consistent but moderate improvements over vanilla dropout: on MNIST and on Fashion-MNIST, with faster convergence.
6. Computational Overhead and Implementation Considerations
DropoutTS variants generally require little to no additional parameterization or major architectural change:
- Sample-adaptive dropout for time series: +4 scalars per dropout layer. Training-time computation is increased (due to FFT and MLP-based noise scoring), but convergence is faster, reducing overall training cost. At inference, all adaptive calculations are dropped and standard fixed dropout rates are applied.
- Curriculum scheduling: No additional cost beyond exponential scheduling of .
- Ensemble selection: Reduces memory (half parameters vs. dual-network methods) and increases per-step speed (one less backward pass).
- Tabu Dropout: No extra parameters; effectively overhead per pass.
A plausible implication is that DropoutTS is scalable to large networks and datasets, and can be universally applied to any dropout-capable architecture.
7. Theoretical Foundations and Limitations
The generalization benefits of DropoutTS approaches are underpinned by data-dependent regularization theory: for sample-adaptive dropout, optimal regularization strength increases with per-sample noise (heteroscedastic theory), and for curriculum dropout, annealed regularization aligns with a well-structured curriculum (Morerio et al., 2017, Zhong et al., 29 Jan 2026). Rademacher complexity analysis shows that stronger regularization can tighten generalization bounds on noisy instances.
However, limitations of DropoutTS include additional per-sample computation (notably FFT/IFFT for time series), potential sensitivity to parameter bounds (, ), and, in the case of Tabu Dropout, short-memory constraints on unit masking.
References:
- Curriculum Dropout (Morerio et al., 2017)
- DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting (Zhong et al., 29 Jan 2026)
- Dropout can Simulate Exponential Number of Models for Sample Selection Techniques (Lakshya, 2022)
- Dropout with Tabu Strategy for Regularizing Deep Neural Networks (Ma et al., 2018)