Split-MNIST Task: A Continual Learning Benchmark

Updated 10 October 2025

Split-MNIST is a benchmark that partitions the MNIST dataset into sequential binary tasks, clearly defining incremental challenges and catastrophic forgetting evaluation.
It employs methodologies like generative replay, functional regularization, and advanced neural architectures to maintain high task accuracy and mitigate forgetting.
Research extends into privacy-preserving learning and hardware optimizations, ensuring robust knowledge transfer and error resilience in real-world incremental environments.

The Split-MNIST task is a canonical benchmark used in continual and incremental learning, where the MNIST dataset is partitioned into a sequence of subtasks. Each subtask typically consists of classifying a disjoint subset of MNIST digit classes (for example, [0, 1], [2, 3], ..., [8, 9]), enabling systematic evaluation of catastrophic forgetting and knowledge transfer as new classes are introduced sequentially. This scenario involves training a model on each split (subtask) sequentially, without revisiting previous data, and measuring the model's ability to retain prior knowledge while adapting to new classes. The Split-MNIST protocol has become foundational for research in lifelong learning, continual learning algorithms, and privacy-preserving collaborative learning.

1. Task-Based Splitting and Its Rationale

The Split-MNIST protocol partitions the ten MNIST digit classes into several (often five) binary classification tasks, e.g., Task 1: {0, 1}, Task 2: {2, 3}, ..., Task 5: {8, 9}. For each sequential task, the learner is exposed only to data from the current subtask; previous task data is not reintroduced. This arrangement allows researchers to systematically evaluate the phenomenon of catastrophic forgetting—where models lose performance on earlier tasks when optimized on new ones. The methodology generalizes to larger, more diverse datasets (e.g., EMNIST By_Class or By_Merge (Cohen et al., 2017)), where tasks may involve mixed digits and letters or fine-grained subset splits, increasing challenge and diversity.

2. Dataset Transformation and Compatibility

The creation of Split-MNIST tasks leverages the emulation of task-based continual learning. The EMNIST dataset (Cohen et al., 2017) provides compatible variants for such task-based splits:

Image Dimensions and Structure: All EMNIST variants maintain the 28×28 grayscale image format, ensuring drop-in compatibility with MNIST-optimized architectures.
Conversion Process: Images are processed from raw NIST 128×128 binary format using a standardized pipeline—Gaussian filtering ( $\sigma=1$ ), bounding box extraction, centralization with padding, and bi-cubic resizing. The formula for Gaussian filtering is

$I_1(x, y) = (I * G)(x, y) = \sum_{i, j} I(i, j) G(x - i, y - j)$

where $G(x, y; \sigma) = \frac{1}{2\pi\sigma^2} \exp\left(-\frac{x^2 + y^2}{2\sigma^2}\right)$ .

Task Segmentation: With hundreds of thousands of images and up to 62 classes, EMNIST allows for creation of Split-MNIST analogs using letters, merged classes, or balanced subsets. This increases challenge and serves as a more realistic continual learning testbed.

3. Catastrophic Forgetting and Continual Learning Solutions

The primary research focus in Split-MNIST is catastrophic forgetting, where learning new tasks degrades model performance on prior ones. Approaches to mitigate forgetting include:

Generative Replay and Functional Regularization: In hybrid probabilistic models (Kirichenko et al., 2021), each class and task distribution is modeled via a normalizing flow (e.g., RealNVP, Glow). After each split, a snapshot of the flow is used to replay older data (generative replay), or regularization is applied to maintain the previous functional mappings (functional regularization):

$\mathcal{L}_{FR} = \|f_\theta(x_{FR}) - f_{\theta^{(k)}}(x_{FR})\|^2,$

where $x_{FR}$ are samples generated from snapshot parameters $\theta^{(k)}$ .

Experimental results demonstrate that HCL achieves average per-task accuracy of about 97.17% and low average forgetting (1.53%) on Split-MNIST.

Subspace Feature Extraction: Multi-Subspace Neural Networks (MSNN) (Fang et al., 2020) utilize parallel blocks, each leveraging basis vectors learned via adaptive subspace self-organizing maps (ASSOM), enabling robust extraction of invariant features across splits. MSNN achieved a test error of 0.95% on MNIST and demonstrated resilience to noise disturbance—a property valuable in Split-MNIST scenarios, especially when each task covers different digits and intra-class variation.
Hopfield-CNN Hybrid Architectures: Models integrating deep CNN feature extraction with multi-well Hopfield networks (Farooq, 11 Jul 2025) use k-means clustering to create multiple class prototypes (or wells) per digit. The Hopfield network minimizes an energy function

$E(s_x, s_y) = -\sum_m \exp(-\beta \|s - \mu_m\|^2) + \lambda \|s\|^2,$

yielding robust classification even with significant intra-class variability. High accuracy (up to 99.44%) has been reported by optimizing the number of wells and CNN layers.

4. Privacy and Collaborative Learning in Split Scenarios

Certain implementations of Split-MNIST are used as benchmarks for privacy-preserving federated learning protocols. In label-private split learning (Jiang et al., 11 Oct 2024), two parties (image and label owners) jointly train a model without either side having access to the full raw data:

Secure Dimension Transformation (SecDT): Labels are expanded and obfuscated in a high-dimensional K-class space via randomized one-hot mapping, e.g.,

$\mathcal{M}_{k, K}(y_i) = m_{K, y_i}, \quad m_{K, y_i} \in \rho_{K, y_i},$

with subsequent weighted mapping to recover original label for inference:

$p = \mathcal{WM}(p^{(K)}) = \arg\max_{0 \leq y < k} \mathbf{w}_y \cdot p^{(K)},$

Gradient Normalization: To prevent norm-based label inference attacks, incoming gradients are normalized across batch elements

$\bar{g}_b = g_b \cdot \frac{\varphi}{\|g_b\|_2}, \quad \varphi = \frac{1}{B} \sum_{b=1}^B \|g_b\|_2,$

Noise-Based Obfuscation: Softmax-normalized Gaussian noise is added to transformed labels making K unpredictable to adversaries.

Experimental evaluation demonstrates a substantial reduction in label leakage (lower Attack AUC) without compromising classifier accuracy on MNIST.

5. Hardware Implementations and Edge Learning

Split-MNIST serves as a relevant benchmark for hardware-efficient continual inference schemes. SNN accelerators on FPGA (Caviglia et al., 4 Jul 2025) employ:

Spike Encoding: Poisson rate coding translates image pixel intensities ( $I$ ) to spike generation probabilities:

$P(\text{spike}) = I$

Leaky Integrate-and-Fire in Hardware: Membrane voltages update via power-of-two approximations, enabling efficient bit-shift operations:

$V[t+1] \approx V[t] \gg n + I[t]$

Quantization: Model parameters (weight bit-width $WB$ , membrane $MB$ , fractional precision $FPd$ ) are selected to balance accuracy and resource efficiency on FPGA. For example, a $10/10/6$ configuration delivered up to 97.86% accuracy.

While not explicitly addressing continual learning or Split-MNIST, these accelerators provide a template for future temporal task-driven inference pipelines where memory, energy, and latency constraints are critical.

6. Comparative Performance and Methodological Considerations

Benchmarking on Split-MNIST reveals:

Method/Variant	Average Accuracy (%)	Forgetting (%)	Notes
HCL-FR (Normalizing flow, FR) (Kirichenko et al., 2021)	97.17	1.53	Outperforms Adam and VAE baselines, robust in task-agnostic setting
OPIUM-ELM (Digits-only) (Cohen et al., 2017)	~97.5	--	Slightly higher than original MNIST due to better utilization of 28×28 frame
OPIUM-ELM (Digits+Letters) (Cohen et al., 2017)	~77–80	--	Complexity and ambiguity from letters lower accuracy
MSNN (Fang et al., 2020)	99.05 (error: 0.95)	--	Robust to noise, strong invariant features
Hopfield-CNN Hybrid (Farooq, 11 Jul 2025)	99.44	--	Handles intra-class variability via multi-well attractors

These results emphasize the effectiveness of generative replay/regularization and robust feature extraction strategies. Inclusion of additional letter classes or adversarial task splits sharply increases complexity, requiring specialized models for high retention and generalization.

7. Future Directions in Split-MNIST Research

Ongoing and anticipated research lines include:

Refinement of Subspace and Energy-Based Models: Enhancing kernel quality, adapting the number of parallel subspaces, and improving prototype separation (multi-well landscapes) for larger scales and more intricate splits.
Integration with Incremental and Privacy-Preserving Frameworks: Combining continual learning solutions with advanced privacy defenses (as in SecDT (Jiang et al., 11 Oct 2024)) and federated pipelines.
Hardware-Aware Optimization: Adapting surrogate-gradient SNN training, quantization, and fixed-point arithmetic for real-time continual inference, emphasizing edge deployment criteria.
Handling Realistic Task Sequences: Moving from synthetic, balanced splits to unbalanced, overlapping, or reoccurring tasks, especially as enabled by expanded datasets like EMNIST (Cohen et al., 2017).
Quantitative Analysis of Class Variability: Systematic paper of intra-class variability, ambiguity-induced error rates, and their interplay with forgetting in incremental learning contexts.

A plausible implication is that advances in Split-MNIST methodologies continue to catalyze progress in robust continual learning, privacy assurance, and hardware deployments for pattern recognition. These advances shape standards for evaluation, model comparison, and real-world task deployment in non-stationary learning environments.