Blockwise NAS: Modular Neural Architecture Search

Updated 27 April 2026

Blockwise NAS is a modular strategy that partitions the neural architecture into independent blocks to tackle the exponential complexity of macro design spaces.
It employs methods such as block-level distillation, ILP optimization, and combinatorial multi-armed bandits to efficiently evaluate and assemble optimal network architectures.
Empirical studies show that blockwise NAS can achieve up to 10× higher efficiency and better Pareto frontiers compared to traditional cell-based searches, while enabling hardware-adaptive deployments.

Blockwise Neural Architecture Search (NAS) refers to a family of methodologies that decompose the neural architecture search problem by partitioning the overall network into sequence(s) of independently optimized computational "blocks." Each block is either a computational stage (e.g., inverted bottleneck, residual, transformer sublayer, or cell) or a larger module. Blockwise NAS methods are fundamentally motivated by the need to scale NAS to macro (heterogeneous) search spaces while preserving evaluation fidelity, computational tractability, and cross-hardware deployability. This approach redefines both the complexity landscape and the algorithmic paradigms of architecture search.

1. Blockwise NAS: Motivation and Search-Space Factorization

Blockwise NAS directly addresses the exponential complexity of macro network design by leveraging modular network decompositions. Instead of searching over the space $\mathcal{A}$ of all possible architectures as monoliths, blockwise approaches factorize the search space:

$\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$

where $O_b$ is the set of candidate sub-architectures for block $b$ , and $B$ is the number of blocks (stages, layers, or nodes) (Li et al., 2019).

In contrast to classical cell-based NAS—which designs a single cell and stacks it repeatedly—blockwise NAS works in a macro search space where each block is independently chosen (with or without position-specific libraries), yielding $m^L$ possible networks for $m$ blocks and $L$ stages (Chau et al., 2022). This structure enables (a) local evaluation of block candidates, (b) analytic or data-driven aggregation of block-wise metrics, and (c) use of mathematical programming, surrogate modeling, or staged search to efficiently traverse the exponentially large global architecture space (Qinsi et al., 2023).

2. Methodologies for Blockwise NAS

Multiple algorithmic paradigms underpin blockwise NAS, each characterized by a distinct search, evaluation, and aggregation process.

2.1. Blockwise Distillation and Separate Block Training

Blockwise NAS can mitigate the unreliable ranking endemic to weight-sharing supernet approaches by independently training (distilling) each block. Each block $b$ is trained (optionally under knowledge distillation from a teacher model) to minimize a combined classification and distillation loss (Li et al., 2019, Chau et al., 2022):

$L_{\text{block}}^{(b)} = L_{\text{task}}^{(b)} + \lambda\, L_{\text{distill}}^{(b)}$

where $\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$ 0 and $\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$ 1 is the corresponding block's activation from the teacher.

After per-block training, all block candidates can be fully evaluated, leading to more reliable block-level performance metrics and improved global network ranking fidelity (Li et al., 2019).

2.2. Mathematical Programming Aggregation

Instead of directly evaluating $\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$ 2 architectures, MathNAS estimates the performance contribution of each block $\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$ 3 via single-swapping in a baseline or average-FLOPs network, then predicts any architecture's metrics by summing blockwise effects (Qinsi et al., 2023):

$\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$ 4

An integer linear program (ILP) is then solved over the modular variables $\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$ 5, subject to hardware constraints, yielding globally optimal architectures in $\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$ 6 time (Qinsi et al., 2023). This divide-and-conquer transforms the intractable $\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$ 7 search into a precomputed blockwise probe and fast combinatorial optimization.

2.3. Combinatorial Multi-Armed Bandits

CMAB-NAS models the search as a combinatorial multi-armed bandit. Rather than evaluate the monolithic global reward of a cell, the reward is decomposed into per-block (per-node) local rewards, enabling nested Monte-Carlo search with UCB-driven exploration over levels that correspond to each block/node (Huang et al., 2021). Under the naïve-additive-reward assumption, this factorization both accelerates exploration and permits tractable optimization over extremely large spaces.

2.4. Differentiable and Evolutionary Blockwise NAS

EG-DARTS performs block-level differentiable architecture search with a complexity-regularized, enhanced architecture gradient, followed by evolutionary multi-objective search over global hyperparameters (e.g., channel counts and block repetitions). This yields a blockwise design pipeline combining gradient-based search for blocks and a GA/NSGA-II multi-objective search for network-level assembly (Zhang et al., 2021).

3. Assembly & Evaluation of Blockwise Architectures

Blockwise NAS separates the search pipeline into block candidate construction/evaluation and global architecture assembly.

In methods such as DONNA or LANA, blocks are distilled from a teacher, assigned per-block signatures (distillation loss $\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$ 8 or validation accuracy drop $\mathcal{A} = O_1 \times O_2 \times \cdots \times O_B$ 9), and then blockwise assignments are jointly optimized with respect to a cost constraint via surrogate regression, ILP, or evolutionary search (Chau et al., 2022).
MathNAS simply sums blockwise additive effect sizes for target metrics and optimizes the block selection using mathematical programming (Qinsi et al., 2023).
Post-assembly, a final retraining or fine-tuning stage is typically performed, often with distillation, to restore end-to-end feature compatibility and maximize final validation accuracy (Chau et al., 2022, Li et al., 2019).

Cost constraints (FLOPs, latency, energy) can be directly encoded into the blockwise selection step and solved rapidly by ILP or evolutionary population-based methods (Qinsi et al., 2023, Zhang et al., 2021).

4. Search-Space Design: Macro vs. Cell-Based NAS

Blockwise NAS enables exploration of macro search spaces where each block and its connections may be heterogeneous. BLOX distinguishes macro/blockwise spaces ( $O_b$ 0) from cell-based spaces, and demonstrates that macro search can achieve strictly better accuracy vs. compute Pareto frontiers, at the cost of exponentially larger search spaces (Chau et al., 2022). The use of block-level evaluation and surrogate/proxy modeling makes this large space tractable for practical NAS.

Macro search spaces also allow for richer block topologies (e.g., MBConv, BConv, multi-input/output DAGs), and can be parameterized to preserve computational feasibility—by, e.g., restricting the number of blocks, reducing library size, or applying Pareto filtering per block (Chau et al., 2022, Lu et al., 2024).

5. Hardware- and Deployment-Aware Blockwise NAS

Blockwise NAS is naturally compatible with hardware-adaptive and quantization-aware search strategies. The blockwise structure allows per-block lookup tables of accuracy, latency, and energy for each architectural and quantization choice (bitwidth), which can be used for fast Pareto-based global optimization under target constraints (Lu et al., 2024).

QA-BWNAS extends blockwise NAS to quantized and few-bit mixed-precision models for edge devices. Per-block training followed by per-block post-training quantization and LUT population leads to rapid, accurate estimation of network-wide accuracy and hardware cost, enabling sub-second search times on tasks like Cityscapes segmentation, and yielding models that dominate prior baselines in latency and model size at matched accuracy (Lu et al., 2024, Qinsi et al., 2023).

6. Empirical Findings and Practical Recommendations

Empirically, blockwise NAS offers dramatic improvements in search efficiency and Pareto optimality over conventional (cell-based or monolithic) NAS methods:

On CIFAR-100 macro-space, blockwise methods achieve ≈10× higher efficiency, match or exceed accuracy at up to 10× less compute, and identify better (lower-cost, higher-accuracy) Pareto frontiers than cell-based NAS (Chau et al., 2022).
In large-scale search spaces (MobileNetV3, SuperViT, Transformer), MathNAS and similar frameworks find SOTA architectures in seconds instead of GPU-weeks, with blockwise additive predictions Spearman $O_b$ 1 compared to true accuracy (Qinsi et al., 2023).
QA-BWNAS produces quantized models (INT8 or mixed 4/6/8-bit) that are significantly smaller and/or faster than post-quantized teachers, matching or exceeding their task performance while supporting instant adaptation to hardware constraints (Lu et al., 2024).
The use of a high-quality teacher is essential for blockwise distillation-based search, and short proxy fine-tuning enables efficient candidate filtering (Chau et al., 2022, Li et al., 2019).

Best practices for blockwise NAS include: moderate block library size and number of stages; strong teacher selection; using per-block distillation or signature surrogates; proxy fine-tuning filtering; and multi-objective or ILP-based selection for hardware/resource constraints (Chau et al., 2022, Qinsi et al., 2023).

7. Limitations and Open Questions

While blockwise NAS provides major advances, critical limitations and research questions remain:

The reliability of blockwise proxies and their ability to predict context-sensitive interactions is a point of empirical but not theoretical justification. Surrogate models and block signatures show diminishing correlation in later network stages (Chau et al., 2022).
Additive or independent block contributions may not fully capture nonlocal dependencies or emergent global behaviors, though empirical results demonstrate high predictive accuracy for many tasks (Qinsi et al., 2023).
The choice of block granularity and library composition demands careful balance: too fine leads to supernet-style challenges, too coarse explodes search cost.
For hardware-aware search, per-block latency or energy must be profiled on target platforms, and cost models require regular maintenance and validation (Lu et al., 2024).

A plausible implication is that the interplay between blockwise modularity, surrogate modeling, and cross-stage interaction will remain a central research theme for scalable, deployable neural architecture search.