Early Exit Class Means (E²CM)
- Early Exit Class Means (E²CM) is a method that uses class mean vectors derived from intermediate features to decide when to exit early during inference.
- It computes Euclidean distances between features and class means, applies softmax over negative distances, and uses tunable thresholds to balance accuracy and computational cost.
- Experimental results demonstrate significant FLOPs savings and accuracy trade-offs compared to traditional gradient-based early-exit schemes across various architectures and datasets.
Early Exit Class Means (ECM) is a neural network early exit technique that uses class mean vectors derived from feature representations to enable efficient inference and training in both supervised and unsupervised learning contexts. ECM operates without gradient-based training of internal classifiers and does not modify the architecture or parameters of the base network, providing an attractive solution for deployment on low-power or resource-constrained devices. The method leverages Euclidean distances between intermediate representations and class means, employing a softmax-based confidence criterion and tunable thresholds to determine whether to exit early or continue computation to deeper layers. ECM has demonstrated favorable trade-offs in accuracy and computational cost compared to traditional early exit schemes (Görmez et al., 2021).
1. Formal Framework: Class Means and Early Exit Rule
Given a training dataset with class labels and a pretrained -layer network , the notation denotes the output after layer , where is the -th layer and the input. For each class , define and as its size. The class-mean vector at layer for class is
At inference, for a sample , for each exit-eligible layer , the distance to each class mean is computed:
These distances are normalized per class and layer:
where denotes distance for the -th sample during training. Softmax over negative normalized distances yields approximate class probabilities:
A layer-specific threshold is employed: if , inference halts and the class prediction is given by ; otherwise, computation proceeds to the next layer. Thresholds are tuned on the training set to meet FLOPs or accuracy constraints via binary search (Görmez et al., 2021).
2. Algorithmic Workflow
ECM operates in two phases: a non-iterative training pass and an inference phase. No internal classifiers or gradient calculation are required.
Training phase:
- For each layer :
- For each training sample, compute .
- For each class , compute and store class mean vector .
Inference phase, for a test input :
- For each layer :
- Compute .
- Compute distances for .
- Normalize distances to obtain .
- Compute via softmax.
- If , return predicted class .
- If no threshold is exceeded, return the final-layer network prediction.
Thresholds are calibrated to a target FLOPs envelope, e.g., via binary search (Görmez et al., 2021).
3. Computational and Memory Overhead
At each potential exit layer , ECM stores class-mean vectors, each matching the dimensionality of . The cumulative extra memory is
Examples illustrate that, for ResNet-152 on CIFAR-10 (), the total overhead is about versus a base model size of , i.e., less than . For (CIFAR-100), this rises linearly, approaching overhead. Dimensionality reduction such as max-pooling can mitigate this at negligible impact to separability.
In terms of computational demands, each early-exit layer adds vector subtractions, norms, normalization, and one softmax per inference. Overheads are small: for example, in ResNet-152 or WideResNet-101 on CIFAR-10, this is about $0.007$ normalized extra FLOPs per layer; for CIFAR-100, $0.057$; and with max-pooling on ImageNet, about $0.001$.
Relative to gradient-based internal classifiers (ICs), which require learning additional parameters and entail larger test-time FLOPs and memory, ECM's overhead is minimal (Görmez et al., 2021).
4. Experimental Findings and Trade-Offs
Experiments evaluated ECM against early-exit schemes including Shallow-Deep Networks (SDN), BranchyNet, and BWDS, across MobileNetV3, EfficientNet, ResNet, and datasets such as CIFAR-10, CIFAR-100, ImageNet, and KMNIST.
Key regimes:
- One-epoch training (supervised): ECM achieves up to higher accuracy at equivalent FLOPs or fewer FLOPs at comparable accuracy across all reported models and datasets.
- Limited multi-epoch training (e.g., $6$ epochs, with BWDS): Combining ECM with SDN, e.g., on ResNet-152/CIFAR-100 at of FLOPs, yielded early-exit accuracy versus (SDN), (BWDS), or (BranchyNet).
- Fine-tuning (ImageNet, TinyImageNet): EfficientNet-B0 and MobileNetV3Large showed higher accuracy at lower FLOPs compared to SDN and BranchyNet, with class-mean memory as low as (ImageNet) and (TinyImageNet) after pooling.
- Unlimited training (ECM+ICs): At low FLOPs ( of full model), ECM+SDN improved accuracy by up to over SDN alone; for moderate FLOPs (), ECM+SDN yielded versus (SDN) (Görmez et al., 2021).
5. Unsupervised Learning Extension
ECM extends directly to unsupervised clustering. Incorporation within Deep Embedding Clustering (DEC) is achieved by (a) training multiple DEC networks with varying encoder depths, (b) extracting, copying, and freezing early encoder layers, each with an attached clustering layer (CL), and (c) treating DEC cluster-centres as class means . The same distance–softmax–thresholding exit rule is applied at each CL. On MNIST and Fashion-MNIST, ECM enabled FLOPs saving for () clustering accuracy loss respectively.
6. Distinguishing Characteristics and Applicability
ECM is distinguished by its plug-and-play nature: no gradient-based training or internal classifier design is necessary, nor is any modification of the base network required. The method’s storage and compute overheads remain modest, particularly with pooling, and it provides practical FLOPs/accuracy trade-offs, especially under tight training constraints or in edge-device scenarios. ECM achieves notable gains over gradient-based early-exit methods in both supervised and unsupervised paradigms, and can further enhance existing early-exit schemes by direct combination (Görmez et al., 2021).
A plausible implication is that ECM is especially suitable for applications with limited training compute or storage, or where model modifications are infeasible, such as wireless edge networks.