Posterior-Guided NAS: Bayesian & Hybrid Methods

Updated 1 September 2025

Posterior-Guided NAS is defined as a family of neural architecture search methods leveraging posterior inference (often Bayesian) to jointly optimize architectures and weights.
It employs hybrid variational approximations and techniques like Gumbel-Softmax reparameterization to enable efficient, gradient-based search across high-dimensional spaces.
The approach integrates surrogate models and predictor-guided evolution to reduce evaluation costs and improve reliability on benchmark tasks such as CIFAR-10 and ImageNet.

Posterior-Guided Neural Architecture Search (PGNAS) refers to a family of neural architecture search methodologies that incorporate posterior inference—often Bayesian but not exclusively so—as the guiding principle for sampling, ranking, and selecting candidate neural networks. Rather than relying solely on performance signals from limited task training or uniform search over the design space, PGNAS harnesses the learned or inferred posterior distribution (typically over architectures and/or model weights) to steer the search toward high-performing, efficient, or otherwise optimal models. The term "posterior-guided" distinguishes these approaches from classical evaluation-driven NAS and positions them within a spectrum of probabilistic model selection and surrogate-based optimization.

1. Bayesian Formulation and Hybrid Representations

PGNAS formalizes the NAS problem as posterior estimation over the joint space of architectures ( $\alpha$ ) and weights ( $w$ ) given observed data ( $\mathcal{D}_t$ ). The canonical objective is to select architecture-weight pairs ( $\alpha, w$ ) sampled from $\mathbb{P}(\alpha, w \mid \mathcal{D}_t)$ that minimize validation loss: $\alpha^* = \arg\min_{(\alpha, w) \sim p(\alpha, w | \mathcal{D}_t)} \mathcal{L}(\mathcal{M}(\alpha, w); \mathcal{D}_v)$ Since calculating the true posterior is intractable in high-dimensional spaces, PGNAS typically employs variational approximations $q_\theta(\alpha, w)$ parameterized by $\theta$ and optimizes this distribution by minimizing its divergence from the true posterior, often via the ELBO (Evidence Lower Bound) formulation: $\mathcal{L}_{VI}(\theta) = \mathrm{KL}\big(q_\theta(\phi) \Vert p(\phi)\big) - \sum_i \log p(y_i | f^\phi(x_i))$ The "hybrid variable" $\phi$ encodes both architecture choices and weights, e.g. $\phi_{l,k}^s = w_{l,k}^s \cdot \alpha_{l,k}^s$ , fusing selection and importance.

This joint representation allows end-to-end differentiation—particularly via variational dropout and Gumbel-Softmax reparameterization—enabling gradient-based optimization of both model structure and parameters. Compared to conventional NAS approaches, which separately train architectures and weights or rely on discrete search, the hybrid mechanism in PGNAS substantially improves flexibility and search efficiency (Zhou et al., 2019).

2. Posterior-Guided Sampling and Surrogate Models

PGNAS leverages the learned variational posterior to sample architecture-weight pairs adaptively. Sampling is performed by activating or dropping components (e.g., convolutional kernels) with probabilities learned during training, which capture the posterior marginal over architecture features. The procedure avoids manual hyperparameter tuning of dropout rates or sampling temperatures.

Surrogate models—often graph-based neural predictors—can be trained to approximate the mapping from architecture encoding to predicted performance metrics, extending posterior guidance to broader evolutionary or Bayesian optimization frameworks. For example,

In NPENAS (Wei et al., 2020), a graph isomorphism network outputs the mean $\mu(s)$ and standard deviation $\sigma(s)$ of a Gaussian that models architecture performance: $f(s) \sim \mathcal{N}(\mu(s), \sigma(s))$ .
Acquisition functions, such as Thompson Sampling, utilize surrogate uncertainty to balance exploration and exploitation.

The search process thus exploits posterior or predictive distributions to select promising candidates, reducing both the number of expensive full evaluations and the risk of bias toward over-represented subspaces.

PGNAS Variant	Posterior Type	Surrogate Role
PGNAS (Zhou et al., 2019)	Joint over $(\alpha,w)$	Guides dropout-based stochastic sampling
NPENAS-BO (Wei et al., 2020)	Surrogate $\mathcal{N}(\mu,\sigma)$	Acquisition for evolutionary selection
GPNAS (Ai et al., 2021)	Bayesian on topologies	GCN predictor refines BOHB sampling

One-shot NAS approaches—training a super-network and evaluating architecture "slices" under shared weights—are subject to "posterior fading": the divergence between a candidate's true parameter posterior (if trained alone) and its proxy posterior (under shared weights). This effect degrades the reliability of architecture ranking, as observed in PC-NAS (Li et al., 2019): $\mathrm{KL}\big(p_{alone}(\theta|m, \mathcal{D}), p_{share}(\theta|\mathcal{D})\big) \uparrow \text{ as } K \uparrow$ where $K$ is the number of architectures included in the supergraph.

PGNAS (and PC-NAS) mitigate this by either progressively shrinking the search space (maintaining pools of partial models) or by coupling sample selection with posterior convergence. Ensuring architecture-weight fidelity during sampling alleviates the mismatch that would arise in naive one-shot methods.

Empirical measures, such as Pearson correlation between accuracy estimated by shared weights and accuracy measured after retraining, substantiate improvements: PC-NAS achieves $\rho = 0.92$ vs. $0.11$ with previous methods on ImageNet.

4. Predictor-Guided Evolution and Surrogate-Based Exploration

Evolutionary PGNAS variants integrate surrogate predictors to rank and select offspring architectures. In NPENAS (Wei et al., 2020), both uncertainty-aware and deterministic predictors are used:

Neural predictors receive architecture graphs, embed them, and predict mean/variance or fitness directly.
Random architecture sampling mechanisms improve coverage and reduce sampling bias in search spaces.

In PRE-NAS (Peng et al., 2022), predictor-guided evolution is combined with topological homogeneity and high-fidelity weight inheritance, ensuring that offspring architectures differ minimally from parents. This design increases the reliability of weight transfer, improves the accuracy of predictions, and reduces computational cost.

Method	Predictor Type	Offspring Generation	Weight Transfer
NPENAS-BO (Wei et al., 2020)	Uncertainty-aware GNN	Mutation	Standard
PRE-NAS (Peng et al., 2022)	Random Forest	Multi-mutation	High-fidelity

5. Sample Efficiency, Multi-Objective Optimization, and Empirical Validation

PGNAS methods consistently demonstrate high sample efficiency in benchmark evaluations:

PGNAS-MI achieves $1.98\%$ error on CIFAR-10 with just $11$ GPU days (Zhou et al., 2019).
AG-Net, a generative PGNAS variant (Lukasik et al., 2022), discovers architectures with $94.18\%$ accuracy on NAS-Bench-101 using only $192$ queries.
NPENAS-BO matches oracle baselines on NASBench-201 with $8.93\%$ mean error over $600$ trials and is $4.7\times$ faster than BANANAS.

Multi-objective extensions are natural: AG-Net jointly optimizes accuracy and hardware latency, imposing constraints $g_h(G) \le L$ and using weighted retraining to bias the generator toward feasible designs.

Architectures discovered via PGNAS have been validated on tasks including object detection (COCO), person re-ID (Market-1501), and large-scale image classification (ImageNet), often outperforming manually designed, RL-based, or proxyless architectures in both accuracy and efficiency.

6. Implementation, Stability Analysis, and Practical Impact

PGNAS frameworks are often implemented with hybrid search spaces (e.g., decoupled skip-connection and operator choices (Ai et al., 2021)), GNN-based predictors, and differentiable architecture encoding. Practical tools such as BOHB, variational dropout, and Gumbel-Softmax relaxation are standard.

Stability analysis, exemplified by GPNAS (Ai et al., 2021), evaluates the robustness of network structures to operator variations, providing additional reliability for downstream deployment.

Key practical benefits and implications:

Substantial reduction in computational cost (e.g., $0.6$ GPU days for competitive DARTS architectures in PRE-NAS).
Robustness against search space bias and proxy estimator noise through careful sampling and prediction mechanisms.
Applicability to resource-constrained domains (e.g., mobile inference) and rapid prototyping.

7. Relation to Biological and Physics-Guided Frameworks

The "posterior" in PGNAS need not be limited to Bayesian posteriors on parametric models. Teacher-guided search (Bashivan et al., 2018) employs representational similarity analysis (RSA) to compare candidate network activations—i.e., internal posteriors—to those of biological teachers (primate ventral stream recordings). This matching yields efficiency improvements and enables direct guidance from latent neural codes.

NAS-PINN (Wang et al., 2023) applies continuous relaxation and bi-level optimization to discover architectures suitable for physics-informed neural networks. Although not strictly posterior-guided, the optimization parallels PGNAS by integrating architecture search into the solution of PDE-constrained problems, informing future extensions where physics-based priors serve as posterior guides.

Posterior-Guided Neural Architecture Search constitutes a principled paradigm for automating model design. By exploiting posterior inference—whether probabilistic, surrogate-driven, or biologically inspired—PGNAS improves search efficiency, reliability, and generalization. Its technical repertoire includes Bayesian variational methods, predictor-guided evolution, multi-objective sampling, and explicit mechanisms for handling search space, sample bias, and parameter uncertainty. The empirical evidence from benchmark tasks confirms its status as an efficient and reliable approach for both standard and domain-adapted neural architecture optimization.