Papers
Topics
Authors
Recent
Search
2000 character limit reached

Early Exit Class Means (E²CM)

Updated 31 January 2026
  • Early Exit Class Means (E²CM) is a method that uses class mean vectors derived from intermediate features to decide when to exit early during inference.
  • It computes Euclidean distances between features and class means, applies softmax over negative distances, and uses tunable thresholds to balance accuracy and computational cost.
  • Experimental results demonstrate significant FLOPs savings and accuracy trade-offs compared to traditional gradient-based early-exit schemes across various architectures and datasets.

Early Exit Class Means (E2^2CM) is a neural network early exit technique that uses class mean vectors derived from feature representations to enable efficient inference and training in both supervised and unsupervised learning contexts. E2^2CM operates without gradient-based training of internal classifiers and does not modify the architecture or parameters of the base network, providing an attractive solution for deployment on low-power or resource-constrained devices. The method leverages Euclidean distances between intermediate representations and class means, employing a softmax-based confidence criterion and tunable thresholds to determine whether to exit early or continue computation to deeper layers. E2^2CM has demonstrated favorable trade-offs in accuracy and computational cost compared to traditional early exit schemes (Görmez et al., 2021).

1. Formal Framework: Class Means and Early Exit Rule

Given a training dataset D={(x0(i),y(i))}i=1ND = \{(x_0^{(i)}, y^{(i)})\}_{i=1}^N with class labels y(i){1,,K}y^{(i)} \in \{1, \ldots, K\} and a pretrained MM-layer network FF, the notation xj(i)=j(xj1(i))x_j^{(i)} = \ell_j(x_{j-1}^{(i)}) denotes the output after layer jj, where j\ell_j is the jj-th layer and x0(i)x_0^{(i)} the input. For each class kk, define Sk={i:y(i)=k}S_k = \{ i : y^{(i)} = k \} and Sk|S_k| as its size. The class-mean vector at layer jj for class kk is

cjk=1SknSkxj(n).c_j^k = \frac{1}{|S_k|} \sum_{n \in S_k} x_j^{(n)}.

At inference, for a sample x0x_0, for each exit-eligible layer jj, the distance to each class mean is computed:

djk=xjcjk2.d_j^{k} = \| x_j - c_j^k \|_2.

These distances are normalized per class and layer:

d~jk=djk1Np=1Ndjk,(p),\tilde d_j^k = \frac{d_j^k}{\frac{1}{N}\sum_{p=1}^N d_j^{k,(p)}},

where djk,(p)d_j^{k,(p)} denotes distance for the pp-th sample during training. Softmax over negative normalized distances yields approximate class probabilities:

P(y^j=k)=softmaxk(d~jk).P(\hat y_j = k) = \mathrm{softmax}_k ( -\tilde d_j^k ).

A layer-specific threshold Tj[0,1]T_j \in [0,1] is employed: if maxkP(y^j=k)>Tj\max_k P(\hat y_j = k) > T_j, inference halts and the class prediction is given by k=argmaxkP(y^j=k)k^* = \arg\max_k P(\hat y_j = k); otherwise, computation proceeds to the next layer. Thresholds TjT_j are tuned on the training set to meet FLOPs or accuracy constraints via binary search (Görmez et al., 2021).

2. Algorithmic Workflow

E2^2CM operates in two phases: a non-iterative training pass and an inference phase. No internal classifiers or gradient calculation are required.

Training phase:

  • For each layer j=1,,Mj=1,\ldots, M:
    • For each training sample, compute xj(i)x_j^{(i)}.
    • For each class k=1,,Kk=1,\ldots,K, compute and store class mean vector cjkc_j^k.

Inference phase, for a test input x0x_0:

  • For each layer j=1,,Mj=1,\ldots, M:
    • Compute xjx_j.
    • Compute distances djkd_j^k for k=1,,Kk = 1,\ldots,K.
    • Normalize distances to obtain d~jk\tilde d_j^k.
    • Compute P(y^j=k)P(\hat y_j = k) via softmax.
    • If maxkP(y^j=k)>Tj\max_k P(\hat y_j = k) > T_j, return predicted class kk^*.
  • If no threshold is exceeded, return the final-layer network prediction.

Thresholds TjT_j are calibrated to a target FLOPs envelope, e.g., via binary search (Görmez et al., 2021).

3. Computational and Memory Overhead

At each potential exit layer jj, E2^2CM stores KK class-mean vectors, each matching the dimensionality of xjx_j. The cumulative extra memory is

j=1M(Kdim(xj)).\sum_{j=1}^{M} (K \cdot \text{dim}(x_j)).

Examples illustrate that, for ResNet-152 on CIFAR-10 (K=10K=10), the total overhead is about 20.9MB20.9\,\text{MB} versus a base model size of 222MB222\,\text{MB}, i.e., less than 10%10\%. For K=100K=100 (CIFAR-100), this rises linearly, approaching 94%94\% overhead. Dimensionality reduction such as 2×22 \times 2 max-pooling can mitigate this at negligible impact to separability.

In terms of computational demands, each early-exit layer adds KK vector subtractions, KK 2\ell_2 norms, normalization, and one softmax per inference. Overheads are small: for example, in ResNet-152 or WideResNet-101 on CIFAR-10, this is about $0.007$ normalized extra FLOPs per layer; for CIFAR-100, $0.057$; and with max-pooling on ImageNet, about $0.001$.

Relative to gradient-based internal classifiers (ICs), which require learning additional parameters and entail larger test-time FLOPs and memory, E2^2CM's overhead is minimal (Görmez et al., 2021).

4. Experimental Findings and Trade-Offs

Experiments evaluated E2^2CM against early-exit schemes including Shallow-Deep Networks (SDN), BranchyNet, and BWDS, across MobileNetV3, EfficientNet, ResNet, and datasets such as CIFAR-10, CIFAR-100, ImageNet, and KMNIST.

Key regimes:

  • One-epoch training (supervised): E2^2CM achieves up to 50%50\% higher accuracy at equivalent FLOPs or 50%50\% fewer FLOPs at comparable accuracy across all reported models and datasets.
  • Limited multi-epoch training (e.g., $6$ epochs, with BWDS): Combining E2^2CM with SDN, e.g., on ResNet-152/CIFAR-100 at 25%25\% of FLOPs, yielded 52%52\% early-exit accuracy versus 27%27\% (SDN), 42%42\% (BWDS), or 26%26\% (BranchyNet).
  • Fine-tuning (ImageNet, TinyImageNet): EfficientNet-B0 and MobileNetV3Large showed higher accuracy at lower FLOPs compared to SDN and BranchyNet, with class-mean memory as low as 1.87MB1.87\,\text{MB} (ImageNet) and 0.87MB0.87\,\text{MB} (TinyImageNet) after pooling.
  • Unlimited training (E2^2CM+ICs): At low FLOPs (15%15\% of full model), E2^2CM+SDN improved accuracy by up to 6%6\% over SDN alone; for moderate FLOPs (30%30\%), E2^2CM+SDN yielded 88.8%88.8\% versus 88.4%88.4\% (SDN) (Görmez et al., 2021).

5. Unsupervised Learning Extension

E2^2CM extends directly to unsupervised clustering. Incorporation within Deep Embedding Clustering (DEC) is achieved by (a) training multiple DEC networks with varying encoder depths, (b) extracting, copying, and freezing early encoder layers, each with an attached clustering layer (CL), and (c) treating DEC cluster-centres as class means cjkc_j^k. The same distance–softmax–thresholding exit rule is applied at each CL. On MNIST and Fashion-MNIST, E2^2CM enabled 60%60\% FLOPs saving for 6%6\% (1%1\%) clustering accuracy loss respectively.

6. Distinguishing Characteristics and Applicability

E2^2CM is distinguished by its plug-and-play nature: no gradient-based training or internal classifier design is necessary, nor is any modification of the base network required. The method’s storage and compute overheads remain modest, particularly with pooling, and it provides practical FLOPs/accuracy trade-offs, especially under tight training constraints or in edge-device scenarios. E2^2CM achieves notable gains over gradient-based early-exit methods in both supervised and unsupervised paradigms, and can further enhance existing early-exit schemes by direct combination (Görmez et al., 2021).

A plausible implication is that E2^2CM is especially suitable for applications with limited training compute or storage, or where model modifications are infeasible, such as wireless edge networks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Early Exit Class Means (E$^2$CM).