Structural-to-Modular NAS

Updated 15 April 2026

The paper introduces a two-stage SM-NAS process that decomposes architecture search into structural and modular phases to discover Pareto-efficient detector models.
It leverages multi-objective evolutionary algorithms and GPU-aware strategies, optimizing both accuracy (e.g., mAP) and computational cost (e.g., FLOPs, latency).
Empirical results show SM-NAS outperforms traditional NAS methods by fine-tuning module combinations and configurations, enhancing performance in object detection and classification.

Structural-to-Modular Neural Architecture Search (SM-NAS) denotes a class of neural architecture search (NAS) methodologies in which the search process is decomposed into a structural-level phase (which selects architectural module combinations and high-level structural patterns) and a modular-level phase (which optimizes the fine-grained configuration of each module). SM-NAS approaches were motivated by the observation that state-of-the-art deep learning models, particularly in computer vision tasks such as object detection, are complex compositions of multiple, functionally-distinct modules—e.g., feature extraction backbone, necks (such as FPNs), region proposal networks (RPNs), and object detection heads. These modules admit a combinatorial variety of configurations, each yielding different trade-offs between computational cost and task accuracy. SM-NAS thus formally addresses both the structural (macro) and modular (micro) aspects of neural architecture design, targeting Pareto-efficient solutions on the empirical latency–accuracy plane (Yao et al., 2019).

1. Motivation and Problem Definition

In classical NAS, the focus is often on searching for improved architectures of individual modules (e.g., the backbone, feature fusion neck) while treating the remaining system structure as fixed. This myopic perspective neglects significant global trade-offs introduced by varying the combination of modules and input resolutions. SM-NAS introduces a principled multi-objective search targeting the simultaneous optimization of architecture for both individual modules and their inter-module structure, driven by explicit hardware-specific constraints such as inference latency or floating-point operation counts (FLOPs).

Formally, SM-NAS seeks solutions to the following multi-objective optimization: $\min_X F(X) = (-\text{Acc}(X), \text{Cost}(X))$ where $\text{Acc}(X)$ denotes detection accuracy (e.g., mAP on COCO) and $\text{Cost}(X)$ is either empirical GPU inference time or analytically computed FLOPs. Pareto dominance is utilized: a candidate $X$ dominates $Y$ if $\text{Acc}(X)\geq\text{Acc}(Y)$ and $\text{Cost}(X)\leq\text{Cost}(Y)$ with at least one strict (Yao et al., 2019).

2. Two-Stage Coarse-to-Fine Search Strategy

SM-NAS frameworks, such as in "SM-NAS: Structural-to-Modular Neural Architecture Search for Object Detection" (Yao et al., 2019), operationalize this principle through a two-stage, coarse-to-fine search process:

Stage 1 (Structural-level): The search space is the Cartesian product of:
- Backbone networks $B$ (e.g., {ResNet18, ResNet34, ResNet50, ResNet101, ResNeXt50, ResNeXt101, MobileNetV2}),
- Neck variants $N$ (e.g., {no-FPN, FPN(P₂–P₄), ..., FPN(P₁–P₆)}),
- Region Proposal Network types $R$ (e.g., {no-RPN, RPN, GA-RPN}),
- Head types $\text{Acc}(X)$ 0 (e.g., {RetinaNet-head, 2FC-head, Cascade-head(n)}),
- Input resolutions $\text{Acc}(X)$ 1 ({512×512, ..., 1333×800}). Each candidate $\text{Acc}(X)$ 2 is a tuple $\text{Acc}(X)$ 3.
Stage 2 (Modular-level): For each Pareto-optimal structure $\text{Acc}(X)$ 4 from Stage 1, the search space opens the internal configuration of modules. For the backbone, for example, this means parameterizing by block type, base channel width $\text{Acc}(X)$ 5, number of blocks $\text{Acc}(X)$ 6 in each of 5 stages, and channel-doubling flags $\text{Acc}(X)$ 7. The neck's output channels $\text{Acc}(X)$ 8 are also exposed. The total search space thus becomes a high-dimensional set of "width-depth" configurations for each structural setting.

The output is a refined Pareto front of architectures spanning low to high computational cost, each optimized to their resource regime (Yao et al., 2019).

3. Multi-Objective Evolutionary Search and GPU-Aware Optimization

SM-NAS implements an evolutionary algorithm to traverse these large search spaces. The search proceeds as follows:

Population: A set of candidate architectures is sampled at random from the search space.
Selection and Mutation: At each generation, a parent is selected from the current Pareto front, and mutated to produce a child—by swapping modules in Stage 1, or perturbing module depths, widths, or neck channels in Stage 2.
Partial Order Pruning (POP): Mutations producing strictly deeper and wider candidates (inevitably worse in cost) are pruned without evaluation.
Fitness Evaluation: Each candidate is rapidly trained for a fixed (low) number of epochs using dedicated strategies (see below), and (Acc, Cost) is measured. In Stage 1, actual V100 inference time is used; in Stage 2, FLOPs is measured to avoid cost noise.
Pareto Sorting: Non-dominated sorting is used to maintain a current Pareto front. Offspring replace less efficient candidates as the population evolves.

This procedure yields a family of architectures covering the optimal trade-off surface, suitable for selection given any task or hardware constraint (Yao et al., 2019).

4. Fast Training from Scratch: Normalization, Weight Standardization, and Evaluation Efficiency

To keep the search computationally feasible, especially at the modular level, SM-NAS abandons expensive pretraining strategies. Instead:

Group Normalization (GN): Used in lieu of BatchNorm, which is ineffective for very small batch sizes typical in NAS searches [Wu & He, 2018].
Weight Standardization (WS): Applied on all convolutional weights, which empirically smooths the loss landscape and accelerates convergence [Qiao et al., 2019].
Aggressive Learning Rate and Scheduling: High base learning rate (e.g., 0.24) with cosine decay, and very small batch sizes (8 per GPU).
Lightweight Augmentation: Basic flip and scale jitter. Empirical ablation (see Table 4 in (Yao et al., 2019)) shows that GN+WS, combined with the above recipe, closes the gap between "train from scratch" and ImageNet-pretrained baselines within half the standard training epochs, making exhaustive search over modular configurations tractable.

5. Empirical Results and Benchmarking

SM-NAS discovers six reference detector architectures ("E0" through "E5"), each characterized by a unique combination of backbone, neck, RPN, head, and input resolution, together with precise modular configurations. These models achieve the following:

E0 achieves 27.1 mAP at 24.5 ms latency with 7.2 Giga-FLOPs.
E2 matches or outperforms FPN–ResNet50 while halving inference time: 40.1 mAP at 39.5 ms (Yao et al., 2019).
E5 reaches 46.1 mAP at 108.1 ms, matching the inference time of Mask R-CNN but with higher AP.

Comparison against state-of-the-art detectors (including Faster R-CNN, RetinaNet, NAS-FPN) on COCO test-dev demonstrates that SM-NAS architectures span the accuracy-latency Pareto front, outperforming hand-designed and previous NAS-crafted detectors across multiple regimes. Transferability is validated by observed improvements on alternate datasets such as VOC and BDD (Yao et al., 2019).

A summary table (abbreviated):

Model	Input	Backbone	Neck	RPN	Head	FLOPs	Time (ms)	mAP
E0	512×512	basicblock_64_1–21–21–12	FPN(P₂–P₅, c=128)	RPN	2FC	7.2	24.5	27.1
E2	800×600	basicblock_48_12–...–1112	FPN(P₁–P₅, c=128)	RPN	Cascade(n=3)	23.8	39.5	40.1
E5	1333×800	Xbottleneck_56_21–...	FPN(P₁–P₅, c=256)	GA-RPN	Cascade(n=3)	162.5	108.1	46.1

6. Variants and Generalizations

The SM-NAS pattern is not limited to detection. The ModuleNet framework ("ModuleNet: Knowledge-inherited Neural Architecture Search" (Chen et al., 2020)) extends SM-NAS to classification by defining a knowledge base of network "cells" (modules) extracted from pretrained networks. The search space is all recombinations of these modules at consistent positions, and architectures are searched by NSGA-II. No retraining of convolutional modules occurs during the search: only small classifier layers are fine-tuned, dramatically cutting compute costs. Candidate architectures are scored by a metric combining validation error, loss decrease rate, and architectural similarity to prior nets. ModuleNet achieves substantial gains over ResNet/VGG baselines and prior NAS cell-based methods, with test error on CIFAR-100 reduced to 15.87% when using ModuleNet with cutout, compared to 22.97% for the best ResNet plus cutout. Results are similarly strong for CIFAR-10 and transfer to ImageNet (Chen et al., 2020).

General-purpose formal languages for architecture search, such as the one introduced by Negrinho et al. (Negrinho et al., 2019), further support SM-NAS methodology via modular, composable definitions of search spaces using computational graph abstractions with substitution modules and both independent and dependent hyperparameters. This allows researchers to systematically map structural specifications of architectures into modular, programmable search spaces that are decoupled from the search algorithm, supporting rapid exploration of SM-NAS design spaces across modalities and objectives (Negrinho et al., 2019).

7. Discussion, Best Practices, and Limitations

SM-NAS enables:

Comprehensive exploration of both the combinatorial module structure and internal module variations, supporting fine-grained Pareto optimization for task-specific objectives and hardware targets.
Efficient search by employing tailored training and evaluation recipes, population-based evolution, and pruning strategies.
Knowledge reuse via modular, cell-based NAS as exemplified by ModuleNet, which leverages pretrained weights for fast evaluation.

However, several limitations persist:

The empirical scoring metrics typically used during early-phase rapid evaluation may only correlate, not precisely predict, final performance after full retraining (Chen et al., 2020).
Fixed module positions or receptive field requirements may limit the compositionality of inherited modules, constraining potential innovation in macro-structure (Chen et al., 2020).
Extensions beyond classification and object detection, such as segmentation, require careful rethinking of module interfaces and requirements.
The ultimate effectiveness of SM-NAS is bounded by the expressivity of its search space definitions and the efficiency of the underlying search algorithm, though best practices around modularization, parameterization, and separation of search space from search algorithm are regarded as mitigating factors (Negrinho et al., 2019).

The SM-NAS paradigm has been disseminated in open-source repositories (e.g., the official MMDetection codebase) to facilitate adoption and further experimentation by the academic and applied machine learning communities (Yao et al., 2019).