Papers
Topics
Authors
Recent
2000 character limit reached

Prototype-based Dark Experience Replay

Updated 14 November 2025
  • The paper introduces a novel continual learning framework for smart grid fault prediction combining prototype-based regularization with dark experience distillation.
  • It integrates multiple loss components and prototype-aware memory updates to mitigate catastrophic forgetting and maintain discriminative feature clusters.
  • Experimental results show that ProDER outperforms DER++ by significantly improving accuracy and reducing memory footprints across diverse smart grid scenarios.

Prototype-based Dark Experience Replay (ProDER) is a continual learning (CL) framework introduced to address the challenges of fault prediction in evolving smart grid environments. By unifying prototype-based feature regularization, dark-experience (logit) distillation, and semantically informed memory management, ProDER improves knowledge retention and adaptability to new classes or domains without requiring task-specific heads or unbounded memory growth. Building on the DER++ baseline, it incorporates prototype-attraction/repulsion and prototype-aware sample replay to maintain both compact intra-class clustering and strong inter-class discrimination over time.

1. Problem Setting and Motivation

Smart grid fault prediction under continual learning presents two key obstacles: catastrophic forgetting—where model performance degrades on previously learned grid conditions as new fault types or grid zones are introduced—and representation drift, which erodes the separability and stability of internal feature representations. ProDER is proposed with the aim to:

  • Retain predictive accuracy on old fault-type and fault-zone classes.
  • Adapt efficiently to new classes and evolving operational zones.
  • Maintain well-separated and compact feature clusters for each class in the learned embedding space.

This is achieved using a single evolving model and a fixed-size memory buffer, making the approach amenable to practical constraints of edge and substation deployment in smart grids (Efatinasab et al., 7 Nov 2025).

2. Algorithmic Structure

The ProDER framework extends DER++ by adding prototype-based regularization and prototype-guided memory organization. The core algorithm, executed for each (possibly explicitly segmented) continual learning round, consists of the following steps:

  1. Prototype Computation: For each class cc seen so far, recompute the prototype pcp_c as the mean of embeddings for examples labeled cc from both the current task and the replay memory.

pc=1Dc(xi,yi=c)φθ(xi)p_c = \frac{1}{|\mathcal{D}_c|}\sum_{(x_i, y_i=c)}\varphi_\theta(x_i)

where φθ\varphi_\theta denotes the feature extractor.

  1. Joint Training with Multiple Loss Terms: During each epoch, input minibatches comprise both new samples and replayed memory items. Four loss components are combined:
    • Cross-entropy on new-task examples,
    • KL-divergence for dark-experience (logit) distillation on memory samples,
    • Prototype attraction (pulls features toward the class prototype),
    • Prototype repulsion (pushes class prototypes apart).
  2. Prototype-aware Memory Update: After each task, for each class, select both core and boundary examples: a fraction ρ\rho of samples closest to the prototype (core/typical) and 1ρ1-\rho fraction of the most distant (boundary/atypical) to populate the memory buffer, replacing the oldest if necessary to maintain fixed capacity.

The following pseudocode organizes these steps precisely:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
initialize model parameters θ
initialize empty replay memory 𝓜 = 
for t = 1  T do
    receive new-task data 𝒟ₜ = {(xᵢ,yᵢ)}
    for class c:
        P[c]  (1 / |𝒟ₜ𝓜|) Σ_{(xᵢ,yᵢ)=c} φ_θ(xᵢ)
    for epoch = 1  E do
        for minibatch B from 𝒟  sampled(𝓜) do
            # forward pass and loss computation
            # (see Section 3 for exact formulation)
            backpropagate _θ L_total, update θ
    for each class c:
        candidates = {(xᵢ,yᵢ,z_i,f_i=φ_θ(xᵢ))}yᵢ=c
        select ρ·K core + (1ρ)·K boundary samples
        add these K samples to 𝓜 (evict oldest if needed)

3. Objective Function and Loss Components

ProDER combines four loss terms, integrated into a single optimization objective for each minibatch: Ltotal=LCE+αLdistill+βLproto+γLrepel\mathcal{L}_{\mathrm{total}} = \mathcal{L}_{\mathrm{CE}} + \alpha\,\mathcal{L}_{\mathrm{distill}} + \beta\,\mathcal{L}_{\mathrm{proto}} + \gamma\,\mathcal{L}_{\mathrm{repel}} where:

  • LCE\mathcal{L}_{\mathrm{CE}}: Cross-entropy on the new-task samples,

LCE=1BnewiBnewc1{yi=c}log(softmax(zi)c)\mathcal{L}_{\mathrm{CE}} = -\frac{1}{|B_{new}|}\sum_{i\in B_{new}}\sum_{c} \mathbf{1}\{y_i=c\}\log\left(\mathrm{softmax}(z_i)_c\right)

  • Ldistill\mathcal{L}_{\mathrm{distill}}: KL divergence between current and stored logits (soft targets) for replayed samples,

Ldistill=KL(softmax(zi/T)  softmax(ziold/T))\mathcal{L}_{\mathrm{distill}} = \mathrm{KL}\left(\mathrm{softmax}(z_i/T)\ \|\ \mathrm{softmax}(z_i^{\rm old}/T)\right)

  • Lproto\mathcal{L}_{\mathrm{proto}}: Mean squared distance of each feature to its class prototype,

Lproto=1BiBφθ(xi)pyi22\mathcal{L}_{\mathrm{proto}} = \frac{1}{|B|}\sum_{i\in B}\|\varphi_\theta(x_i) - p_{y_i}\|_2^2

  • Lrepel\mathcal{L}_{\mathrm{repel}}: Repulsion term encouraging separation of class prototypes,

Lrepel=1C(C1)i=1Cjiexp(pipj)\mathcal{L}_{\mathrm{repel}} = \frac{1}{C(C-1)}\sum_{i=1}^C\sum_{j\neq i}\exp(-\|p_i-p_j\|)

Hyperparameters (α,β,γ)(\alpha, \beta, \gamma) are scenario-specific; e.g., (2,5,0.5)(2,5,0.5) and (2,7.2,0.5)(2,7.2,0.5) in different class- and domain-incremental tasks.

4. Feature Extractor, Memory Buffer, and Architecture

ProDER uses a unified recurrent neural architecture suitable for multivariate sensor time series:

  • Input: WW time steps ×\times F=51F=51 features.
  • Feature extractor φθ\varphi_\theta: Bidirectional GRU with 150 hidden units/direction ($300$-dim embedding per time step), followed by $0.3$ dropout.
  • Classifier head hh: Dynamic fully connected layer R300RC\mathbb{R}^{300}\to\mathbb{R}^{C}, where CC expands as classes are introduced; prior class weights remain unchanged.
  • Replay Memory M\mathcal{M}: Stores tuples (x,y,z,φθ(x))(x,y,z,\varphi_\theta(x)). Memory is capped (e.g., M=363M=363 samples, 23.5%\approx23.5\% of the dataset) and updated using a prototype-bias strategy to maintain both typical and boundary samples for each class.

This architecture avoids task-specific heads and maintains a fixed computational and memory profile, favoring scalability.

5. Continual Learning Scenarios and Evaluation Protocol

ProDER was evaluated under four distinct, realistic continual learning scenarios for evolving smart grids:

Scenario Task Sequence Description
1 +2 fault types per task (tasks 1-5) 3 classes initially, 2 new per task, total 11
2 +1 fault type per task (tasks 1-9) 3 classes initially, 1 new per task, total 11
3 Domain-incremental (zones) All 11 faults, new input distribution per task (new grid zone each task)
4 +1 grid-zone class per task (tasks 1-3) 2 zones initially, add one per task, total 4 zone classes

Performance is reported as final accuracy (ACC) on all seen classes after training on the last task, and the performance gap to a Joint-Training upper bound (single batch with full data access).

6. Experimental Results and Comparative Performance

ProDER consistently outperforms both experience replay (ER) and the dark experience replay advanced variant (DER++), closing up to half the gap to the joint-training upper bound across all scenarios. Notable results include:

Scenario Joint ACC/gap ER ACC/gap DER++ ACC/gap ProDER ACC/gap
1 (faults +2) 0.658 / 0.000 0.459 / 0.199 0.558 / 0.100 0.613 / 0.045
2 (faults +1) 0.658 / 0.000 0.462 / 0.196 0.519 / 0.139 0.576 / 0.082
3 (zones) 0.658 / 0.000 0.524 / 0.134 0.524 / 0.134 0.610 / 0.048
4 (zone-incr.) 0.981 / 0.000 0.917 / 0.064 0.956 / 0.025 0.966 / 0.015

A steady memory footprint (1\leq 1 MB) and independence from task-specific heads establish ProDER's scalability. The results demonstrate that the addition of prototype objectives yields more robust feature space structure and improved knowledge retention.

7. Implications, Limitations, and Future Directions

Implications:

ProDER offers a practical framework for continual smart-grid fault prediction deployable in real-world edge scenarios. Prototype-based regularization regularizes class-wise representations, supporting fault-class separability, and potentially improves explainability and anomaly detection in grid monitoring.

Limitations:

The approach presumes known task boundaries, which may not always be available in live operational settings. Replay memory requires storing authentic sensor windows, possibly conflicting with regulatory constraints.

Future Directions:

ProDER could be extended to handle unknown or fuzzy task transitions, embrace additional metadata (e.g., fault severity), or to support mixed event types in real-time deployments. Evaluating adaptation to blurred or undetected task changes remains an open research challenge (Efatinasab et al., 7 Nov 2025).


References:

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Prototype-based Dark Experience Replay (ProDER).