Early Exit Class Means (E²CM)

Updated 31 January 2026

Early Exit Class Means (E²CM) is a method that uses class mean vectors derived from intermediate features to decide when to exit early during inference.
It computes Euclidean distances between features and class means, applies softmax over negative distances, and uses tunable thresholds to balance accuracy and computational cost.
Experimental results demonstrate significant FLOPs savings and accuracy trade-offs compared to traditional gradient-based early-exit schemes across various architectures and datasets.

Early Exit Class Means (E $^2$ CM) is a neural network early exit technique that uses class mean vectors derived from feature representations to enable efficient inference and training in both supervised and unsupervised learning contexts. E $^2$ CM operates without gradient-based training of internal classifiers and does not modify the architecture or parameters of the base network, providing an attractive solution for deployment on low-power or resource-constrained devices. The method leverages Euclidean distances between intermediate representations and class means, employing a softmax-based confidence criterion and tunable thresholds to determine whether to exit early or continue computation to deeper layers. E $^2$ CM has demonstrated favorable trade-offs in accuracy and computational cost compared to traditional early exit schemes (Görmez et al., 2021).

1. Formal Framework: Class Means and Early Exit Rule

Given a training dataset $D = \{(x_0^{(i)}, y^{(i)})\}_{i=1}^N$ with class labels $y^{(i)} \in \{1, \ldots, K\}$ and a pretrained $M$ -layer network $F$ , the notation $x_j^{(i)} = \ell_j(x_{j-1}^{(i)})$ denotes the output after layer $j$ , where $\ell_j$ is the $j$ -th layer and $x_0^{(i)}$ the input. For each class $k$ , define $S_k = \{ i : y^{(i)} = k \}$ and $|S_k|$ as its size. The class-mean vector at layer $j$ for class $k$ is

$c_j^k = \frac{1}{|S_k|} \sum_{n \in S_k} x_j^{(n)}.$

At inference, for a sample $x_0$ , for each exit-eligible layer $j$ , the distance to each class mean is computed:

$d_j^{k} = \| x_j - c_j^k \|_2.$

These distances are normalized per class and layer:

$\tilde d_j^k = \frac{d_j^k}{\frac{1}{N}\sum_{p=1}^N d_j^{k,(p)}},$

where $d_j^{k,(p)}$ denotes distance for the $p$ -th sample during training. Softmax over negative normalized distances yields approximate class probabilities:

$P(\hat y_j = k) = \mathrm{softmax}_k ( -\tilde d_j^k ).$

A layer-specific threshold $T_j \in [0,1]$ is employed: if $\max_k P(\hat y_j = k) > T_j$ , inference halts and the class prediction is given by $k^* = \arg\max_k P(\hat y_j = k)$ ; otherwise, computation proceeds to the next layer. Thresholds $T_j$ are tuned on the training set to meet FLOPs or accuracy constraints via binary search (Görmez et al., 2021).

2. Algorithmic Workflow

E $^2$ CM operates in two phases: a non-iterative training pass and an inference phase. No internal classifiers or gradient calculation are required.

Training phase:

For each layer $j=1,\ldots, M$ $j = 1, \dots, M$ :
- For each training sample, compute $x_j^{(i)}$ .
- For each class $k=1,\ldots,K$ , compute and store class mean vector $c_j^k$ .

Inference phase, for a test input $x_0$ :

For each layer $j=1,\ldots, M$ $j = 1, \dots, M$ :
- Compute $x_j$ .
- Compute distances $d_j^k$ for $k = 1,\ldots,K$ .
- Normalize distances to obtain $\tilde d_j^k$ .
- Compute $P(\hat y_j = k)$ via softmax.
- If $\max_k P(\hat y_j = k) > T_j$ , return predicted class $k^*$ .
If no threshold is exceeded, return the final-layer network prediction.

Thresholds $T_j$ are calibrated to a target FLOPs envelope, e.g., via binary search (Görmez et al., 2021).

3. Computational and Memory Overhead

At each potential exit layer $j$ , E $^2$ CM stores $K$ class-mean vectors, each matching the dimensionality of $x_j$ . The cumulative extra memory is

$\sum_{j=1}^{M} (K \cdot \text{dim}(x_j)).$

Examples illustrate that, for ResNet-152 on CIFAR-10 ( $K=10$ ), the total overhead is about $20.9\,\text{MB}$ versus a base model size of $222\,\text{MB}$ , i.e., less than $10\%$ . For $K=100$ (CIFAR-100), this rises linearly, approaching $94\%$ overhead. Dimensionality reduction such as $2 \times 2$ max-pooling can mitigate this at negligible impact to separability.

In terms of computational demands, each early-exit layer adds $K$ vector subtractions, $K$ $\ell_2$ norms, normalization, and one softmax per inference. Overheads are small: for example, in ResNet-152 or WideResNet-101 on CIFAR-10, this is about $0.007$ normalized extra FLOPs per layer; for CIFAR-100, $0.057$; and with max-pooling on ImageNet, about $0.001$.

Relative to gradient-based internal classifiers (ICs), which require learning additional parameters and entail larger test-time FLOPs and memory, E $^2$ CM's overhead is minimal (Görmez et al., 2021).

4. Experimental Findings and Trade-Offs

Experiments evaluated E $^2$ CM against early-exit schemes including Shallow-Deep Networks (SDN), BranchyNet, and BWDS, across MobileNetV3, EfficientNet, ResNet, and datasets such as CIFAR-10, CIFAR-100, ImageNet, and KMNIST.

Key regimes:

One-epoch training (supervised): E $^2$ CM achieves up to $50\%$ higher accuracy at equivalent FLOPs or $50\%$ fewer FLOPs at comparable accuracy across all reported models and datasets.
Limited multi-epoch training (e.g., $6$ epochs, with BWDS): Combining E $^2$ CM with SDN, e.g., on ResNet-152/CIFAR-100 at $25\%$ of FLOPs, yielded $52\%$ early-exit accuracy versus $27\%$ (SDN), $42\%$ (BWDS), or $26\%$ (BranchyNet).
Fine-tuning (ImageNet, TinyImageNet): EfficientNet-B0 and MobileNetV3Large showed higher accuracy at lower FLOPs compared to SDN and BranchyNet, with class-mean memory as low as $1.87\,\text{MB}$ (ImageNet) and $0.87\,\text{MB}$ (TinyImageNet) after pooling.
Unlimited training (E $^2$ CM+ICs): At low FLOPs ( $15\%$ of full model), E $^2$ CM+SDN improved accuracy by up to $6\%$ over SDN alone; for moderate FLOPs ( $30\%$ ), E $^2$ CM+SDN yielded $88.8\%$ versus $88.4\%$ (SDN) (Görmez et al., 2021).

5. Unsupervised Learning Extension

E $^2$ CM extends directly to unsupervised clustering. Incorporation within Deep Embedding Clustering (DEC) is achieved by (a) training multiple DEC networks with varying encoder depths, (b) extracting, copying, and freezing early encoder layers, each with an attached clustering layer (CL), and (c) treating DEC cluster-centres as class means $c_j^k$ . The same distance–softmax–thresholding exit rule is applied at each CL. On MNIST and Fashion-MNIST, E $^2$ CM enabled $60\%$ FLOPs saving for $6\%$ ( $1\%$ ) clustering accuracy loss respectively.

6. Distinguishing Characteristics and Applicability

E $^2$ CM is distinguished by its plug-and-play nature: no gradient-based training or internal classifier design is necessary, nor is any modification of the base network required. The method’s storage and compute overheads remain modest, particularly with pooling, and it provides practical FLOPs/accuracy trade-offs, especially under tight training constraints or in edge-device scenarios. E $^2$ CM achieves notable gains over gradient-based early-exit methods in both supervised and unsupervised paradigms, and can further enhance existing early-exit schemes by direct combination (Görmez et al., 2021).

A plausible implication is that E $^2$ CM is especially suitable for applications with limited training compute or storage, or where model modifications are infeasible, such as wireless edge networks.

Markdown Report Issue Upgrade to Chat

References (1)

E$^2$CM: Early Exit via Class Means for Efficient Supervised and Unsupervised Learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Early Exit Class Means (E$^2$CM).

Early Exit Class Means (E²CM)

1. Formal Framework: Class Means and Early Exit Rule

2. Algorithmic Workflow

3. Computational and Memory Overhead

4. Experimental Findings and Trade-Offs

Key regimes:

5. Unsupervised Learning Extension

6. Distinguishing Characteristics and Applicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Early Exit Class Means (E²CM)

1. Formal Framework: Class Means and Early Exit Rule

2. Algorithmic Workflow

3. Computational and Memory Overhead

4. Experimental Findings and Trade-Offs

Key regimes:

5. Unsupervised Learning Extension

6. Distinguishing Characteristics and Applicability

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research