Blockwise NAS: Modular Neural Architecture Search
- Blockwise NAS is a modular strategy that partitions the neural architecture into independent blocks to tackle the exponential complexity of macro design spaces.
- It employs methods such as block-level distillation, ILP optimization, and combinatorial multi-armed bandits to efficiently evaluate and assemble optimal network architectures.
- Empirical studies show that blockwise NAS can achieve up to 10× higher efficiency and better Pareto frontiers compared to traditional cell-based searches, while enabling hardware-adaptive deployments.
Blockwise Neural Architecture Search (NAS) refers to a family of methodologies that decompose the neural architecture search problem by partitioning the overall network into sequence(s) of independently optimized computational "blocks." Each block is either a computational stage (e.g., inverted bottleneck, residual, transformer sublayer, or cell) or a larger module. Blockwise NAS methods are fundamentally motivated by the need to scale NAS to macro (heterogeneous) search spaces while preserving evaluation fidelity, computational tractability, and cross-hardware deployability. This approach redefines both the complexity landscape and the algorithmic paradigms of architecture search.
1. Blockwise NAS: Motivation and Search-Space Factorization
Blockwise NAS directly addresses the exponential complexity of macro network design by leveraging modular network decompositions. Instead of searching over the space of all possible architectures as monoliths, blockwise approaches factorize the search space:
where is the set of candidate sub-architectures for block , and is the number of blocks (stages, layers, or nodes) (Li et al., 2019).
In contrast to classical cell-based NAS—which designs a single cell and stacks it repeatedly—blockwise NAS works in a macro search space where each block is independently chosen (with or without position-specific libraries), yielding possible networks for blocks and stages (Chau et al., 2022). This structure enables (a) local evaluation of block candidates, (b) analytic or data-driven aggregation of block-wise metrics, and (c) use of mathematical programming, surrogate modeling, or staged search to efficiently traverse the exponentially large global architecture space (Qinsi et al., 2023).
2. Methodologies for Blockwise NAS
Multiple algorithmic paradigms underpin blockwise NAS, each characterized by a distinct search, evaluation, and aggregation process.
2.1. Blockwise Distillation and Separate Block Training
Blockwise NAS can mitigate the unreliable ranking endemic to weight-sharing supernet approaches by independently training (distilling) each block. Each block is trained (optionally under knowledge distillation from a teacher model) to minimize a combined classification and distillation loss (Li et al., 2019, Chau et al., 2022):
where 0 and 1 is the corresponding block's activation from the teacher.
After per-block training, all block candidates can be fully evaluated, leading to more reliable block-level performance metrics and improved global network ranking fidelity (Li et al., 2019).
2.2. Mathematical Programming Aggregation
Instead of directly evaluating 2 architectures, MathNAS estimates the performance contribution of each block 3 via single-swapping in a baseline or average-FLOPs network, then predicts any architecture's metrics by summing blockwise effects (Qinsi et al., 2023):
4
An integer linear program (ILP) is then solved over the modular variables 5, subject to hardware constraints, yielding globally optimal architectures in 6 time (Qinsi et al., 2023). This divide-and-conquer transforms the intractable 7 search into a precomputed blockwise probe and fast combinatorial optimization.
2.3. Combinatorial Multi-Armed Bandits
CMAB-NAS models the search as a combinatorial multi-armed bandit. Rather than evaluate the monolithic global reward of a cell, the reward is decomposed into per-block (per-node) local rewards, enabling nested Monte-Carlo search with UCB-driven exploration over levels that correspond to each block/node (Huang et al., 2021). Under the naïve-additive-reward assumption, this factorization both accelerates exploration and permits tractable optimization over extremely large spaces.
2.4. Differentiable and Evolutionary Blockwise NAS
EG-DARTS performs block-level differentiable architecture search with a complexity-regularized, enhanced architecture gradient, followed by evolutionary multi-objective search over global hyperparameters (e.g., channel counts and block repetitions). This yields a blockwise design pipeline combining gradient-based search for blocks and a GA/NSGA-II multi-objective search for network-level assembly (Zhang et al., 2021).
3. Assembly & Evaluation of Blockwise Architectures
Blockwise NAS separates the search pipeline into block candidate construction/evaluation and global architecture assembly.
- In methods such as DONNA or LANA, blocks are distilled from a teacher, assigned per-block signatures (distillation loss 8 or validation accuracy drop 9), and then blockwise assignments are jointly optimized with respect to a cost constraint via surrogate regression, ILP, or evolutionary search (Chau et al., 2022).
- MathNAS simply sums blockwise additive effect sizes for target metrics and optimizes the block selection using mathematical programming (Qinsi et al., 2023).
- Post-assembly, a final retraining or fine-tuning stage is typically performed, often with distillation, to restore end-to-end feature compatibility and maximize final validation accuracy (Chau et al., 2022, Li et al., 2019).
Cost constraints (FLOPs, latency, energy) can be directly encoded into the blockwise selection step and solved rapidly by ILP or evolutionary population-based methods (Qinsi et al., 2023, Zhang et al., 2021).
4. Search-Space Design: Macro vs. Cell-Based NAS
Blockwise NAS enables exploration of macro search spaces where each block and its connections may be heterogeneous. BLOX distinguishes macro/blockwise spaces (0) from cell-based spaces, and demonstrates that macro search can achieve strictly better accuracy vs. compute Pareto frontiers, at the cost of exponentially larger search spaces (Chau et al., 2022). The use of block-level evaluation and surrogate/proxy modeling makes this large space tractable for practical NAS.
Macro search spaces also allow for richer block topologies (e.g., MBConv, BConv, multi-input/output DAGs), and can be parameterized to preserve computational feasibility—by, e.g., restricting the number of blocks, reducing library size, or applying Pareto filtering per block (Chau et al., 2022, Lu et al., 2024).
5. Hardware- and Deployment-Aware Blockwise NAS
Blockwise NAS is naturally compatible with hardware-adaptive and quantization-aware search strategies. The blockwise structure allows per-block lookup tables of accuracy, latency, and energy for each architectural and quantization choice (bitwidth), which can be used for fast Pareto-based global optimization under target constraints (Lu et al., 2024).
QA-BWNAS extends blockwise NAS to quantized and few-bit mixed-precision models for edge devices. Per-block training followed by per-block post-training quantization and LUT population leads to rapid, accurate estimation of network-wide accuracy and hardware cost, enabling sub-second search times on tasks like Cityscapes segmentation, and yielding models that dominate prior baselines in latency and model size at matched accuracy (Lu et al., 2024, Qinsi et al., 2023).
6. Empirical Findings and Practical Recommendations
Empirically, blockwise NAS offers dramatic improvements in search efficiency and Pareto optimality over conventional (cell-based or monolithic) NAS methods:
- On CIFAR-100 macro-space, blockwise methods achieve ≈10× higher efficiency, match or exceed accuracy at up to 10× less compute, and identify better (lower-cost, higher-accuracy) Pareto frontiers than cell-based NAS (Chau et al., 2022).
- In large-scale search spaces (MobileNetV3, SuperViT, Transformer), MathNAS and similar frameworks find SOTA architectures in seconds instead of GPU-weeks, with blockwise additive predictions Spearman 1 compared to true accuracy (Qinsi et al., 2023).
- QA-BWNAS produces quantized models (INT8 or mixed 4/6/8-bit) that are significantly smaller and/or faster than post-quantized teachers, matching or exceeding their task performance while supporting instant adaptation to hardware constraints (Lu et al., 2024).
- The use of a high-quality teacher is essential for blockwise distillation-based search, and short proxy fine-tuning enables efficient candidate filtering (Chau et al., 2022, Li et al., 2019).
Best practices for blockwise NAS include: moderate block library size and number of stages; strong teacher selection; using per-block distillation or signature surrogates; proxy fine-tuning filtering; and multi-objective or ILP-based selection for hardware/resource constraints (Chau et al., 2022, Qinsi et al., 2023).
7. Limitations and Open Questions
While blockwise NAS provides major advances, critical limitations and research questions remain:
- The reliability of blockwise proxies and their ability to predict context-sensitive interactions is a point of empirical but not theoretical justification. Surrogate models and block signatures show diminishing correlation in later network stages (Chau et al., 2022).
- Additive or independent block contributions may not fully capture nonlocal dependencies or emergent global behaviors, though empirical results demonstrate high predictive accuracy for many tasks (Qinsi et al., 2023).
- The choice of block granularity and library composition demands careful balance: too fine leads to supernet-style challenges, too coarse explodes search cost.
- For hardware-aware search, per-block latency or energy must be profiled on target platforms, and cost models require regular maintenance and validation (Lu et al., 2024).
A plausible implication is that the interplay between blockwise modularity, surrogate modeling, and cross-stage interaction will remain a central research theme for scalable, deployable neural architecture search.