DropoutTS: Adaptive Dropout for Time Series

Updated 5 February 2026

DropoutTS comprises adaptive dropout techniques that adjust regulation based on per-sample noise and training phase for improved model robustness.
It employs sample-adaptive dropout via spectral noise scoring and time-dependent scheduling to mitigate early over-regularization and enhance convergence.
The framework also integrates ensemble-based sample selection and Tabu Dropout to diversify network activations while minimizing computational overhead.

DropoutTS encompasses a spectrum of sample-adaptive and time-adaptive dropout approaches that modulate neural network regularization rates based on either instance-wise noise or the optimization curriculum. Originally developed in the context of time series prediction, robust learning under noisy supervision, and more general neural network regularization, DropoutTS methods improve generalization and robustness by shifting from a global, fixed dropout rate to mechanisms where the regularization strength is tuned to the properties of each sample or to the training phase. The principal instantiations are adaptive dropout for robust time series forecasting, curriculum (time-scheduled) dropout, and dropout-based ensemble selection strategies.

1. Sample-Adaptive Dropout for Time Series Forecasting

DropoutTS, as introduced for time series applications, is a model-agnostic, “capacity-centric” module that adapts per-sample dropout rates according to the estimated noise level of each input instance (Zhong et al., 29 Jan 2026). The method computes a real-time, differentiable noise score for each sequence using a spectral reconstruction pipeline, then maps this score to a dropout probability, and finally applies sample-specific dropout during network training.

Spectral Noise Scoring

Detrend the input $\mathbf{x}\in\mathbb{R}^{L\times C}$ using an OLS-fitted linear trend and subtract it.
Compute FFT: For detrended data, obtain the channel-wise amplitude and log-amplitude spectrum.
Spectral Flatness Measure (SFM): Quantify the spectrum's “whiteness,”

$\mathrm{SFM}(\mathbf{A})=\frac{\exp(\frac1K\sum_k\ln A_k)}{\frac1K\sum_k A_k}\in[0,1].$

Adaptive Thresholding: Use an MLP to generate a per-instance threshold $\tau$ in spectral space.
Spectral Masking: Via a soft threshold, attenuate frequency components deemed noisy, reconstructing the “clean” signal by IFFT.
Noise Score: The mean absolute residual between original and reconstructed input,

$s = \frac1{L\,C}\sum_{t=1}^L\sum_{c=1}^C |\mathbf{x}_{t,c} - \hat{\mathbf{x}}_{t,c}|,$

quantifies input corruption without external annotation.

Mapping Noise to Dropout Rate

Let $\{s_i\}$ denote per-sample noise levels, min-max normalized within the batch to $\hat s_i$ . Through a learned $\gamma$ , the probability is

$p_i = p_{\min} + (p_{\max} - p_{\min}) \cdot \tanh(\hat s_i \cdot \mathrm{Softplus}(\gamma)).$

Clean samples ( $\hat s_i \to 0$ ) receive $p_i \approx p_{\min}$ ; noisy samples ( $\hat s_i \to 1$ ) receive $p_i \approx p_{\max}$ . Dropout is then applied using a Masked Bernoulli mechanism with straight-through estimation to enable backpropagation through $p_i$ .

Forward Pass and Implementation

Each network dropout layer replaces standard dropout with sample-adaptive masking:

$\mathbf{B} \sim \mathrm{Bernoulli}(1-p_i), \qquad \mathbf{H}_{\text{out}} = \mathbf{H} \odot \mathbf{M}_{\text{drop}} / (1-p_i).$

All spectral modules are removed at inference, with $p_i$ fixed.

2. Time-Dependent Dropout Scheduling (Curriculum Dropout)

Curriculum Dropout, alternately referenced as DropoutTS in the scheduling context, implements a time-dependent schedule for the retention probability $p(t)$ , replacing the static rate in conventional dropout protocols (Morerio et al., 2017).

Retention Schedule

The exponential curriculum is defined by

$p(t) = (1-\bar p)\exp(-\gamma t) + \bar p, \quad \gamma = \frac{10}{T},$

where $T$ is the total number of training steps and $\bar p$ is the final retention level ($0.5$ for dense, $0.75$ for convolutional layers).

Curriculum-Learning Perspective

Setting $p(0)=1$ suppresses regularization at initialization. As $t$ increases, higher proportions of units are stochastically zeroed, making the task incrementally more difficult in the curriculum learning sense. The entropy of the input distribution increases over training, ensuring graduated exposure to more challenging learning conditions.

Regularization Analysis

For static dropout, regularization weakly depends on $p$ . A time-varying $p(t)$ introduces an annealed regularizer,

$p(t)(1-p(t)),$

with zero penalty early and asymptotic approach to the target regularization strength. This optimally avoids excessive constraint during initial convergence and increases protection against overfitting during later optimization.

3. DropoutTS in Ensemble and Sample Selection Methods

DropoutTS provides an architectural replacement for dual-network training paradigms such as Co-teaching+ and JoCor, particularly under label noise (Lakshya, 2022). Rather than operating two independent models for clean-sample selection, DropoutTS utilizes two independent dropout mask realizations over a single parameter set, thereby simulating an exponential ensemble and enabling efficient sample selection.

Algorithmic Structure

For a batch $\bar D=\{(x_i, y_i)\}$ :

Two forward passes are performed using independent Bernoulli masks, yielding two sets of logits/losses.
The “peer” network's loss is used for “small-loss” clean sample selection.
A single backward pass updates shared parameters only on the subset of samples selected as having the lowest loss under the alternative mask.

This mechanism reduces memory requirements by $2\times$ and decreases computational cost per batch by omitting one backward pass.

4. Tabu Dropout: Diversification by Tabu Strategy

Tabu Dropout (sometimes labeled “DropoutTS” in the literature) enforces diversity in regularization by disallowing consecutive suppression of the same unit (Ma et al., 2018). It maintains a short-term tabu list $T^{(l)}$ (track record of dropped/retained neurons in each layer). If neuron $i$ was dropped in the previous forward pass ( $T^{(l)}_i=0$ ), it must be retained in the current pass; otherwise, standard Bernoulli sampling applies. The evolution of the effective dropout rate obeys $\hat p_{t+1} = (1-\hat p_t)p$ with initial $\hat p_1=p$ .

This method enhances network exploration across the implicit dropout-induced ensemble and further mitigates co-adaptation, as all units participate with guaranteed frequency. The overhead is negligible—a single extra per-unit Boolean memory and comparison per forward pass.

5. Empirical Results and Robustness Analysis

Robustness in Time Series

Extensive evaluation demonstrates that DropoutTS achieves robust performance gains across both synthetic and real-world time series datasets (Zhong et al., 29 Jan 2026). For example, on “Synth-12” (varying noise and non-stationarities), Informer [Informer] with DropoutTS obtains a $46.0\%$ average MSE reduction; Crossformer $11.4\%$ ; PatchTST $2.8\%$ ; TimesNet $3.6\%$ . On public multivariate datasets (ETTh1/2, ETTm1/2, Electricity, Weather, ILI), relative MSE reductions averaged $34.2\%$ for Informer, $7.3\%$ for Crossformer, and $19.4\%$ for TimesNet on ILI.

Ablation confirms the necessity of all pipeline components: omitting OLS-detrending yields $-41.7\%$ MSE, removing spectral normalization $-12.9\%$ , and removing SFM $-31.1\%$ . Cumulatively, adaptive dropout provides $+7.2\%$ improvement over any fixed $p$ .

Generalization in Curriculum Dropout

On image classification, Curriculum Dropout matches or exceeds standard dropout’s generalization across a range of datasets and architectures (see Table 1 reproduced below from (Morerio et al., 2017)):

Dataset & Architecture & Standard Dropout Gain & Curriculum Dropout Gain (Boost)
MNIST–MLP
Double-MNIST
CIFAR-10
Caltech-101

Curriculum Dropout avoids over-regularization at early epochs, resolves the instability seen in abrupt “switch” schedules, and yields lower variance in test accuracy.

Sample Selection Efficiency

Within noisy label regimes, DropoutTS (exponential ensemble approximation) delivers up to $8.9\%$ gain in test accuracy (MNIST, pairflip $45\%$ ) over standard Co-teaching+, and up to $11.8\%$ for JoCor on CIFAR-100 ( $20\%$ symmetric noise) (Lakshya, 2022). Memory and computational overhead are substantially reduced.

Tabu Dropout demonstrates consistent but moderate improvements over vanilla dropout: $+0.25\%$ on MNIST and $+0.25\%$ on Fashion-MNIST, with faster convergence.

6. Computational Overhead and Implementation Considerations

DropoutTS variants generally require little to no additional parameterization or major architectural change:

Sample-adaptive dropout for time series: +4 scalars per dropout layer. Training-time computation is increased (due to FFT and MLP-based noise scoring), but convergence is faster, reducing overall training cost. At inference, all adaptive calculations are dropped and standard fixed dropout rates are applied.
Curriculum scheduling: No additional cost beyond exponential scheduling of $p(t)$ .
Ensemble selection: Reduces memory (half parameters vs. dual-network methods) and increases per-step speed (one less backward pass).
Tabu Dropout: No extra parameters; effectively $O(N)$ overhead per pass.

A plausible implication is that DropoutTS is scalable to large networks and datasets, and can be universally applied to any dropout-capable architecture.

7. Theoretical Foundations and Limitations

The generalization benefits of DropoutTS approaches are underpinned by data-dependent regularization theory: for sample-adaptive dropout, optimal regularization strength increases with per-sample noise (heteroscedastic theory), and for curriculum dropout, annealed regularization aligns with a well-structured curriculum (Morerio et al., 2017, Zhong et al., 29 Jan 2026). Rademacher complexity analysis shows that stronger regularization can tighten generalization bounds on noisy instances.

However, limitations of DropoutTS include additional per-sample computation (notably FFT/IFFT for time series), potential sensitivity to parameter bounds ( $p_{\min}$ , $p_{\max}$ ), and, in the case of Tabu Dropout, short-memory constraints on unit masking.

References:

Curriculum Dropout (Morerio et al., 2017)
DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting (Zhong et al., 29 Jan 2026)
Dropout can Simulate Exponential Number of Models for Sample Selection Techniques (Lakshya, 2022)
Dropout with Tabu Strategy for Regularizing Deep Neural Networks (Ma et al., 2018)

Markdown Upgrade to Chat

References (4)

DropoutTS: Sample-Adaptive Dropout for Robust Time Series Forecasting (2026)

Curriculum Dropout (2017)

Dropout can Simulate Exponential Number of Models for Sample Selection Techniques (2022)

Dropout with Tabu Strategy for Regularizing Deep Neural Networks (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DropoutTS.