Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns

Published 8 May 2026 in cs.LG | (2605.07378v1)

Abstract: Zero-shot proxies, also known as training-free metrics, are widely adopted to reduce the computational overhead in neural network evaluation for scenarios such as Neural Architecture Search (NAS), as they do not require any training. Existing zero-shot metrics have several limitations, including weak correlation with the true performance and poor generalisation across different networks or downstream tasks. For example, most of these metrics apply only to either convolutional neural networks (CNNs) or Transformers, but not both. To address these limitations, we propose Sample-Wise Activation Patterns (SWAP), and its derivative, SWAP-Score, a novel and highly effective zero-shot metric. SWAP-Score is broadly applicable across both architecture families and task domains, demonstrating strong predictive performance in the majority of tasks. This metric measures the expressivity of neural networks over a mini-batch of samples, showing a high correlation with the neural networks' ground-truth performance. For both CNNs and Transformers, the SWAP-Score outperforms existing zero-shot metrics across computer vision and natural language processing tasks. For instance, Spearman's correlation coefficient between the SWAP-Score and CIFAR-10 validation accuracy for DARTS CNNs is 0.93, and 0.71 for FlexiBERT Transformers on GLUE tasks. Moreover, SWAP-Score is label-independent, hence can be applied at the pre-training stage of LLMs to estimate their performance for downstream tasks. When applied to NAS, SWAP-empowered NAS, SWAP-NAS can achieve competitive performance using only approximately 6 and 9 minutes of GPU time, on CIFAR-10 and ImageNet respectively. Our code is available at: https://github.com/pym1024/SWAP_Universal

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces SWAP, a novel method using sample-wise activation patterns to form the SWAP-Score for zero-shot network evaluation.
It integrates a regularization mechanism to control network size bias and achieves high correlations (up to 0.93) with ground-truth accuracy.
SWAP-Score accelerates neural architecture search by providing rapid, cross-domain evaluation for both CNNs and Transformers.

Zero-Shot Neural Network Evaluation with Sample-Wise Activation Patterns (SWAP)

Introduction

This paper introduces Sample-Wise Activation Patterns (SWAP) and its derivative, SWAP-Score, a zero-shot, training-free metric for neural network performance evaluation and neural architecture search (NAS). SWAP-Score is designed to overcome the fundamental limitations of prior zero-shot metrics: limited correlation to ground-truth accuracy, poor cross-domain generalization, and reduced applicability to either CNNs or Transformers, but rarely both. Through an analysis rooted in network expressivity and extensive empirical benchmarking, SWAP-Score demonstrates high predictive accuracy and robust domain generalization across tasks and architectures, with particular efficiency and accuracy enhancements for NAS applications.

Limitations of Existing Zero-Shot Metrics

Traditional neural architecture evaluation requires extensive backpropagation-driven training, limiting scalability for tasks such as NAS, knowledge distillation, and RL. To mitigate this challenge, training-free ("zero-shot") proxies have been proposed. These proxies typically rely on limited property estimation—such as FLOPs, parameter counts, or theoretical measures (e.g., neural tangent kernel spectra)—to provide fast yet coarse predictions.

However, empirical studies show most proxies (e.g., TE-NAS, Zen-Score, MeCo, NWOT, parameter count, FLOPs) either fail to generalize across search spaces (cell-based, macro, Transformer), offer limited or inconsistent correlation with validation accuracy, or even underperform trivial size-based metrics in several domains [25]. Transformer-specific proxies (e.g., Attention Confidence, DSS-Indicator) are likewise weak outside dedicated architecture families [49].

SWAP: Theoretical Foundations and Metric Definition

From Activation Patterns to SWAP

Prior notions of network expressivity tied to counting distinct activation patterns gleaned through standard activation binarization are fundamentally limited by input batch size and network depth/dimensionality [5, 26, 50]. For complex or high-dimensional tasks, the cardinality upper bound saturates quickly and loses discriminatory power.

SWAP generalizes this approach by considering sample-wise rather than value-wise activation patterns. Given an architecture $N$ with randomly initialized parameters $\theta$ and a batch of $S$ input samples, activations are binarized (or ternarized for GELU) across the neuron/intermediate value axis for each input. The SWAP set is defined as

$\mathcal{\widehat{A}}_{N,\theta} = \{ \mathbf{p}_v : \mathbf{p}_v = \text{sign}(p_v^s)_{s=1}^{S}, \forall v \in \{1,...,V\} \}$

where $V$ is the number of intermediate activations, and the cardinality of this set—i.e., the number of unique sample-wise patterns—constitutes the SWAP-Score for that architecture.

Properties and Regularization

The sample-wise encoding yields a much higher capacity for discriminating networks than the original pattern-counting, providing fine-grained sensitivity to network expressivity. However, as with other zero-shot proxies, SWAP-Score correlates with network size, potentially biasing NAS toward oversized models if used standalone.

To control for size bias, the authors introduce a regularization term,

$f(\Omega) = \exp\left( -\frac{(\Omega - \mu)^2}{2\sigma^2} \right),$

where $\Omega$ is parameter count, and $\mu$ , $\sigma$ are user- or adaptively specified parameters. The regularized SWAP-Score is then $Y'_{N, \theta} = Y_{N, \theta} \cdot f(\Omega)$ . This allows tailored control of model size in NAS applications.

Empirical Evaluation

Correlation with Ground-Truth Performance

Computer Vision

SWAP-Score and its regularized variant are benchmarked against 15+ popular zero-shot metrics using 1000 random architectures drawn from diverse search spaces (NAS-Bench-101/201/301, TransNAS-Bench-101-Micro/Macro) and across multiple vision tasks (CIFAR-10/100, ImageNet-1k, ImageNet16-120, object/scene classification, jigsaw, autoencoding). SWAP-Score consistently attains top or near-top Spearman's correlation coefficients with ground-truth accuracy in nearly all settings—often exceeding 0.8 or even 0.9 (e.g., 0.93 on DARTS/CIFAR-10, 0.81+ on challenging macro spaces). Regularization both increases correlation in cell-based search spaces and allows explicit control over model size/accuracy trade-off.

Natural Language Processing

SWAP-Score is evaluated on 500 pre-trained BERT-like Transformers sampled from the FlexiBERT space and fine-tuned on GLUE tasks. When SWAP-Score is computed (label-free, pre-training dataset only), it achieves state-of-the-art correlation coefficients with final task accuracy, reaching 0.71 overall (surpassing prior attention-based proxies and size-based baselines) and strong per-task correlations (up to 0.75 for CoLA, 0.72 for MNLI, 0.73 for SST). Only tasks known to display weak cross-transfer, such as RTE, produce lower alignment—confirming the metric reflects genuine transferability patterns.

Neural Architecture Search (SWAP-NAS)

SWAP-Score is integrated as a performance measure in a regularized evolutionary NAS pipeline (SWAP-NAS) in the DARTS search space. The main findings are:

On CIFAR-10, SWAP-NAS requires only 0.004 GPU days (6 minutes), which is 6.5x faster than the previous SOTA TE-NAS, and outperforms or matches most one-shot and predictor-based NAS methods.
On ImageNet, direct search (without proxy transfer) achieves elite accuracy (23.3% error rate) in 0.006 GPU days (9 minutes), 2.3x faster than the previous SOTA (QE-NAS), supporting SWAP-Score's extreme efficiency.
The regularization parameters ( $\theta$ 0, $\theta$ 1) can be set adaptively online during NAS, supporting search-space-agnostic, end-to-end architecture discovery.
The method allows for balancing performance and model size per end-user requirements.

Ablations: Batch Size, Input Dimension, Regularization

Batch size: For small CNNs, SWAP-Score accuracy decreases with increasing batch size due to pattern saturation, but for large-scale models (e.g., Transformers), batch size is much less important.
Input dimension: Higher input dimension enhances correlation for real data; using random noise as input destroys this correlation.
Regularization: Applying the SWAP-Score regularization function to baselines like FLOPs or parameter count can increase their correlation on some tasks, but regulated SWAP-Score consistently outperforms them, indicating synergistic benefit from both scale normalization and structural expressivity features.

Implications and Future Directions

On a practical level, SWAP-Score enables rapid, accurate, and scalable NAS, knowledge distillation, and architecture pruning, across both vision and language domains. The label- and task-agnostic definition allows for true pre-training evaluation, offering early guidance even in foundation model and pre-training-dominated procedures.

Theoretically, SWAP-Score reframes network expressivity measurement, quantifying sample-wise activation diversity as a unifying metric across architectures and domains. The consistent, robust alignment with ground-truth performance indicates SWAP captures information neglected by prior convex- or kernel-based proxies.

Potential directions for future research include:

Extension of the indicator/binning function for more fine-grained (multi-level) activation analysis.
Study of SWAP-Score properties and data-dependence under varied pre-training datasets, initialization regimes, and architecture motifs.
Integration and analysis within larger multi-modal or multi-task networks, including guiding automated curriculum or architecture adaptation.

Conclusion

Sample-Wise Activation Patterns and SWAP-Score provide an effective, training-free, and highly generalizable tool for zero-shot neural network evaluation. The metric achieves superior correlation with ground-truth performance and strong domain generalization for both CNNs and Transformers across diverse downstream tasks. Its integration with evolutionary NAS yields rapid, state-of-the-art architecture discovery at minimal computational expense. SWAP-Score advances the field toward efficient, scalable, and universally applicable neural architecture selection (2605.07378).

Markdown Report Issue