Per-Module Hyperparameter Optimisation

Updated 30 December 2025

Per-module hyperparameter optimisation is a method that tunes hyperparameters individually for each network module, enhancing stability and convergence in diverse architectures.
It employs systematic scaling laws and search strategies like trust-region random search and Bayesian methods to navigate high-dimensional tuning spaces.
Empirical results show significant speedups and performance gains in models such as transformers and large language models when using this approach.

Per-module hyperparameter optimisation is a methodology in modern deep learning and optimisation wherein hyperparameters are tuned individually for distinct architectural modules or algorithmic components, rather than being shared globally across the whole model or algorithm. This paradigm accommodates the heterogeneous behavior and scaling properties of diverse network or algorithm segments—such as transformer blocks, subnetworks, optimizer modules, or augmentation blocks—yielding improved performance and efficiency. Recent advances validate systematic approaches for both searching and transferring per-module hyperparameters, especially under rapid scaling in model capacity, data, and compute budgets (Mlodozeniec et al., 26 Dec 2025, Mlodozeniec et al., 2023, Treacher et al., 2022, Nobel et al., 2021).

1. Motivation for Per-Module Hyperparameter Optimisation

Traditional global hyperparameter regimes enforce a single learning-rate, weight-decay, or optimizer setting across all parameter groups of a network. In modular architectures (transformers, CNNs, attention blocks, federated and multi-subnetwork models), modules differ substantially in gradient scale, curvature, effective capacity, and data exposure. Empirical findings show that enforcing uniform hyperparameter values can lead to sub-optimal outcomes by orders of magnitude, with distinct modules favoring sharply different configurations for stability and convergence rate (Mlodozeniec et al., 26 Dec 2025).

A plausible implication is that as models are scaled in width, depth, batch size, or training duration, optimal per-module settings increasingly diverge from a one-size-fits-all rule. Empirical evidence further demonstrates that per-module tuning at proxy (small) scale can yield substantial training speedups and generalization improvements when transferred to much larger models (Mlodozeniec et al., 26 Dec 2025, Treacher et al., 2022).

2. Parameterisations and Scaling Laws

State-of-the-art per-module hyperparameter transfer leverages the unified Complete $^{(d)}$ parameterisation, which co-adapts module settings along four axes: width ( $N$ ), depth ( $L$ ), batch size ( $B$ ), and duration ( $D$ ) (Mlodozeniec et al., 26 Dec 2025). This parameterisation prescribes that any per-module hyperparameter $\zeta_{m,\ell}$ at a new scale is computed as:

$\zeta_{m,\ell}(N, L, B, D) = \zeta^{type}_m \cdot \zeta^{depth}_\ell(L) \cdot SDE(B, D) \cdot CP_m(N, L)$

where:

$\zeta^{type}_m$ optimizes module-type multipliers at base scale,
$\zeta^{depth}_\ell(L)$ is a linearly-interpolated depth multiplier,
$SDE(B, D)$ implements the square-root batch-size/duration reparameterisation,
$CP_m(N, L)$ applies explicit scaling laws (such as for learning rates, initialization scales, residual multipliers) per module type (see Table 1 in (Mlodozeniec et al., 26 Dec 2025)).

Each module—such as embeddings, QKV projections, MLP blocks, LayerNorms—is assigned its own learning-rate multiplier ( $\eta_{m,\ell}$ ), weight-decay ( $\lambda_{m,\ell}$ ), AdamW parameters, initialization scale ( $\sigma_{m,\ell}$ ), and, for residual blocks, a residual scaling factor ( $s_{m,\ell}$ ).

Transfer to larger models combines base-scale optimizers with explicit formulaic rescalings (e.g., $\eta_{hidden} \gets \eta_b \cdot N \cdot L \cdot \sqrt{B/D}$ ) for each module and depth group, without requirement for re-tuning at scale (Mlodozeniec et al., 26 Dec 2025).

3. Search Algorithms and High-Dimensionality

The search space for per-module hyperparameter optimisation often reaches 80+ dimensions due to the combination of module types and depths. The response surfaces exhibit “invex” geometry (i.e., the absence of spurious local minima) but contain sharp cliffs where runs diverge (Mlodozeniec et al., 26 Dec 2025). Gaussian Process surrogates become ineffective due to nonstationarity and high failure rates in random search.

A robust approach applies trust-region random search in log-space. Starting from global-tuned hyperparameters, the algorithm samples perturbations within a fixed radius, repeatedly shrinking the radius upon consecutive failures. Each candidate configuration is trained for a full run; up to 5,000 parallel trials are feasible on modern hardware (Mlodozeniec et al., 26 Dec 2025).

For multi-subnetwork architectures, additional strategies include:

Divide-and-Conquer Bayesian Optimisation (DCBO): recombine well-tuned subnetworks from previous top-performing full models, leveraging exponential candidate pool size without incurring full re-training costs (Treacher et al., 2022).
Subnetwork-Adaptive Bayesian Optimisation (SABO): allocate search budget to subnetworks according to their empirically estimated influence on end-to-end loss, favouring exploration of weak links (Treacher et al., 2022).

4. Transfer and Validation Protocols

Per-module hyperparameters optimised at small scale transfer to large-scale models via explicit scaling laws—global multipliers remain unchanged, depth-wise multipliers are interpolated, and formulaic rescalings ensure invariants when adjusting width, depth, batch, and duration (Mlodozeniec et al., 26 Dec 2025).

In applied settings, network and data are partitioned (e.g., as in (Mlodozeniec et al., 2023)), with each block updated on its own shard and out-of-training-sample (OOTS) criteria applied for hyperparameter validation. This marginal-likelihood inspired criterion efficiently optimises hyperparameters (including data-augmentation, regularization, dropout rates) in a single interleaved run, sidestepping the need for separate validation sets or repeated runs.

For algorithmic frameworks such as Modular CMA-ES, per-module tuning comprehensively assesses contributions of new modules by comparing baseline vs extended search spaces, analyzing activation frequencies and interplay among modules, and evaluating performance via metrics like Area Over the ECDF (AOC) and Expected Running Time (ERT) (Nobel et al., 2021).

5. Empirical Performance Gains and Benchmarks

Experimental results validate the effectiveness of per-module hyperparameter optimisation:

Transformer models: per-module tuned hyperparameters reach target loss $2.3\times$ faster than best global hyperparameters, with large-scale transfer yielding $27$% speed-ups (i.e., $1.32\times$ faster) at identical final loss (Mlodozeniec et al., 26 Dec 2025).
LLMs (7.2B params): pre-training accelerated by $\approx 25$ %, improving downstream benchmarks for language, coding, and reasoning (Mlodozeniec et al., 26 Dec 2025).
Multi-subnetwork deep models (CNN/DFNN): DCBO achieves mean speedups of $4.8\times$ to $23.6\times$ , accuracy gains up to $3.5$% and MSE reductions up to $4.4$ vs single-model TPE-based Bayesian optimisation (Treacher et al., 2022).
Neural network partitioning: OOTS-driven validation lowers error rates by $3$–$5$ pp in low-data regimes and achieves competitive performance on MNIST, CIFAR-10, TinyImageNet, and federated learning benchmarks (Mlodozeniec et al., 2023).
Modular optimisers: extending module options in Modular CMA-ES yields up to $97$% improvement in AOC on specific BBOB functions, with clear dependencies among module settings (Nobel et al., 2021).

6. Methodological Guidelines and Analysis

Canonical guidelines for per-module hyperparameter optimisation include:

Decompose architectures into orthogonal modules; encode both categorical (activation/switches) and continuous (scaling) hyperparameters per module (Nobel et al., 2021).
Construct manageable search spaces, leveraging base-scale tuning for global and type-wise multipliers, and continuous interpolation for depth multipliers (Mlodozeniec et al., 26 Dec 2025).
For racing-based tuners (e.g., irace), allocate budget per module group, and replicate assessment runs for robust validation (Nobel et al., 2021).
For multi-subnetwork architectures, prefer DCBO for maximum speed-up, optionally augmenting with SABO when subnetwork relevance is highly imbalanced (Treacher et al., 2022).
Always conduct independent verification of elite configurations, and analyze module activation frequencies and interplay via statistical tests and visualizations (Nobel et al., 2021).
Use robust aggregate metrics, such as Area Over the ECDF, and quantile bands for ERT, to score distributed performance across targets (Nobel et al., 2021).

7. Limitations and Future Directions

While current per-module hyperparameter optimisation strategies demonstrate substantial empirical gains, several constraints and open questions remain:

DCBO/SABO approaches have primarily been validated with TPE-based Bayesian Optimisation; generalization to GP-based BO, population-based training, or RL-driven NAS requires further study (Treacher et al., 2022).
Effectiveness is established for CNN/DFNN subnetworks; extensions to attention mechanisms, RNNs, graph networks, or arbitrary architectures entail additional complexity in loss estimation (Treacher et al., 2022).
Partitioning schemes require careful choice of $K$ to balance granularity of OOTS signals and sufficient representation per block (Mlodozeniec et al., 2023).
Empirical speedups are sensitive to implementation details (e.g., weight freezing, framework specifics); transfer may require adaptation in non-TensorFlow environments (Treacher et al., 2022).
For module assessment in optimisers, high run-to-run variance and stochasticity motivate multiple repeated tuner runs and conservative configuration budgets (Nobel et al., 2021).

A plausible implication is that the future of per-module hyperparameter optimisation may involve automated proxy-driven importance rankings, hierarchical priors, and multi-fidelity or early-stopping budget control to expedite convergence and scalability in increasingly modular deep learning and optimisation landscapes.