Zero-shot Hyperparameter Transfer

Updated 17 March 2026

Zero-shot hyperparameter transfer is a family of methods that configures hyperparameters for unseen tasks using meta-data and scale-aware techniques.
It leverages approaches such as μTransfer, meta-learned surrogates, and ensemble strategies to enable robust, cost-effective hyperparameter selection.
Empirical benchmarks indicate that these methods reduce tuning costs and improve transfer speed while maintaining competitive accuracy across diverse domains.

Zero-shot hyperparameter transfer refers to a family of methodologies that enable the selection or configuration of hyperparameters for new tasks, datasets, model scales, or domains, without explicit search or tuning on the target instance. Instead, transfer occurs via principled statistical, meta-learning, or scale-aware mechanisms that leverage prior information—whether learned surrogates, scale invariance in the optimization landscape, or model-averaged ensembles—to deliver high-performing configurations in the absence of direct target supervision or target validation feedback.

1. Foundational Problem and Formal Definitions

The canonical hyperparameter optimization (HPO) objective is, for a given algorithm $A$ , dataset $D$ , and hyperparameter space $\Lambda$ , to find $\lambda^* = \arg\min_{\lambda \in \Lambda} R(\lambda; D)$ , where $R$ is the expected risk or held-out loss induced by $A$ trained on $D$ with configuration $\lambda$ . Zero-shot hyperparameter transfer generalizes this by seeking, for a collection of prior datasets $\{D_1, ..., D_K\}$ (or tasks, model scales), either:

A static configuration $\lambda^*$ , or more generally
A mapping $\lambda(D)$ from dataset/scale descriptors to $\Lambda$ ,

that minimizes the average or expected risk across new, unseen $D$ : $E_{D \sim \mathcal{P}}[R(\lambda(D); D)]$ . Zero-shot protocols disallow any target-label or target-validation feedback; all information flow occurs through either prior hyperparameter-performance tuples, structured meta-data, or parameterization-induced invariance (e.g., scale-aware transfer) (Gijsbers et al., 2021, Winkelmolen et al., 2020, Yang et al., 2022, Schmidt et al., 2023, Ghosh et al., 28 Dec 2025, Meindl et al., 3 Oct 2025).

2. Approaches and Algorithmic Schemes

Zero-shot hyperparameter transfer comprises several principal categories, each supported by rigorous empirical and theoretical analysis:

2.1 Scale-Aware (Parametrization-Based) Transfer

Maximal Update Parametrization (μP): μP reparameterizes neural networks such that both activations and per-step parameter updates remain $O(1)$ as width $m \to \infty$ . Under μP, optimal hyperparameters (learning rates, optimizer coefficients, initialization scales) empirically and theoretically converge to scale-stable optima, enabling zero-shot transfer from a proxy (small) model to large-scale targets. This holds for Transformers, MLPs, ResNets, and extends to operator regimes (Fourier Neural Operators) with modified scaling laws (Yang et al., 2022, Li et al., 24 Jun 2025, Ghosh et al., 28 Dec 2025).
Algorithmic Procedure:
- Parametrize the large target model in μP.
- Tune all μP-transferable hyperparameters on a smaller proxy (width ≥ 256, batch size, seq len, etc).
- Directly apply the same hyperparameters to the large-scale model; no further tuning.
Scaling Law for FNOs: For Fourier Neural Operators, the μTransfer-FNO scheme rescales initialization variance and learning rate proportional to $1/\sqrt{d\log K}$ , where $K$ is the number of Fourier modes and $d$ is input dimension (Li et al., 24 Jun 2025).

2.2 Surrogate, Meta-Learned, and Ensemble Methods

Meta-Learned Symbolic Defaults: Symbolic or parametric mappings from dataset meta-features (e.g., size, feature type counts, class imbalance) to HP configurations, learned over pools of prior problems using surrogate losses and grammar-based symbolic regression (Gijsbers et al., 2021, Winkelmolen et al., 2020).
Zero-shot AutoML and Meta-surrogates: Deep surrogates (typically MLPs) predict the rank or expected performance of any HP configuration given only simple meta-features of the target dataset (e.g., image resolution, class count), jointly trained on extensive meta-datasets of pipeline/dataset evaluations under a pairwise ranking loss (Öztürk et al., 2022).
Ensemble Surrogates for BO: Bayesian optimization methods aggregate posterior predictions from prior-task GPs (Gaussian Processes) via ranking-weighted, bootstrap-based mixtures, providing hyperparameter recommendations or acquisition functions that adaptively leverage past HP landscapes in zero-shot or few-shot regimes (Feurer et al., 2018).
Zero-shot Combinatorial Set Construction: Greedy submodular optimization on large meta-tables of loss evaluations across many datasets and configurations, yielding small lists of zero-shot default HPs with the guarantee that one performs near-optimally on any new task (Winkelmolen et al., 2020).

2.3 Model-Averaging and Ensembling for Robust Transfer

Accumulative Model Averaging: For multilingual or cross-domain transfer, run-by-run model averaging across distinct hyperparameter realizations (rather than selection by source-dev or target-dev validation) yields superior zero-shot cross-lingual transfer (ZS-XLT), outperforming both single-run and “model soup” baselines even without target labels (Schmidt et al., 2023).

3. Theoretical Guarantees and Transfer Conditions

Fast Transfer and Scale Invariance: The formal notion of fast HP transfer states that the suboptimality penalty of transferred HPs, $\Delta_n = o(G_n)$ , where $G_n$ is the inherent finite-model performance gap. This property holds under mild convexity when the optimum HP itself converges to its infinite-scale counterpart at rate $b_n = o(a_n^{1/2})$ , $a_n = G_n$ (Ghosh et al., 28 Dec 2025). Fast transfer yields compute-optimal tuning: transferred HP search outperforms direct search on large scale when this criterion holds.
Synthetic Positive and Negative Examples: Positive cases (ridge regression, μP-MLE) show both empirical and theoretical verification of fast transfer, while counterexamples (e.g., two-layer ReLU “ball indicator” with high nonlinearity) break this condition (Ghosh et al., 28 Dec 2025).

4. Empirical Benchmarks and Comparative Results

Zero-shot transfer has been validated across diverse modalities, scales, and experimental setups:

Domain	Methodology	Target Metric	Zero-Shot Transfer Performance	Reference
NLP (BERT, GPT-3, XLM-R)	μTransfer (μP)	GLUE, BLEU, F1	Outperforms direct large-scale tuning	(Yang et al., 2022)
PDEs (FNO, PINO)	μTransfer-FNO	L² Error	≤ Direct, 70% tuning cost reduction	(Li et al., 24 Jun 2025)
Cross-lingual NLU	Accum. model averaging	Accuracy/F1	+0.2–5.0pp over source-dev selection	(Schmidt et al., 2023)
Deep Learning (vision)	Zero-shot AutoML/ZAP	Area/LearningCurve	ALC +0.09 over best challenge baseline	(Öztürk et al., 2022)
Meta-ML (tabular)	Symbolic defaults	Log-loss, rank	Matches 8–16× random search	(Gijsbers et al., 2021)
Black-box Opt. (HPO-Bench)	ZeroShotOpt (offline RL)	Norm. regret $P$	$0.885\pm0.009$ (BO: $0.900\pm0.007$ )	(Meindl et al., 3 Oct 2025)
BO, various UCI/Kaggle tasks	Ensemble surrogate (RGPE)	Regret	2–5× speedup over vanilla BO	(Feurer et al., 2018)

Zero-shot methods outperform baselines that select static “best” configurations, random selection, or challenge-winning AutoML systems. For large-scale NNs and PDEs, scale-aware zero-shot transfer eliminates the need for wide HP sweeps, and for black-box HPO, trained RL policies generalize to unseen optimization landscapes without additional tuning.

5. Limitations and Caveats

Domain and Task Specificity: Surrogate- and meta-feature-based zero-shot transfer is limited by the coverage and representativeness of the meta-dataset; performance may degrade on substantial domain shifts (Öztürk et al., 2022, Gijsbers et al., 2021).
Regularizer and Conditional HPs: Some hyperparameters, such as dropout, weight decay, or domain-specialized regularizers, are not reliably transferable and require re-tuning post-transfer even in μP-based protocols (Yang et al., 2022, Li et al., 24 Jun 2025).
Model Dynamics and Decomposition: Fast transfer is empirically contingent on optimization dynamics concentrating loss reduction in a width-stable, low-rank subspace. For complex tasks or pathological loss surfaces, transferred optima can drift, and transfer does not outperform direct tuning (Ghosh et al., 28 Dec 2025).

6. Best Practices, Practical Recipes, and Community Artifacts

μP/Scale-Aware Transfer: Parametrize models in μP and tune transferable HPs only on proxies with minimum recommended width and batch size; empirically verify transfer by loss curve invariance and top- $k$ subspace inspection (Yang et al., 2022, Ghosh et al., 28 Dec 2025).
Meta-learned Surrogates and Symbolic Defaults: Deploy surrogates mapping trivial meta-features (4–8 dimensions) to HP configurations, learned via ranking loss or evolutionary search on large prior meta-datasets (Gijsbers et al., 2021, Öztürk et al., 2022).
Zero-shot HPO Lists: Practitioners can utilize precomputed lists of $K$ default configs (lookup tables published with method results), executing all in parallel and selecting the best by target validation (Winkelmolen et al., 2020).
Model Averaging for Cross-Lingual/NLP: Average $r=5-10$ checkpoints from a diverse HP grid, yielding robust transfer in zero-shot cross-lingual scenarios (Schmidt et al., 2023).
RL-Based Optimizers: Employ offline-pretrained transformer-based optimizers such as ZeroShotOpt for continuous black-box HPO without acquisition-function configuration or kernel selection (Meindl et al., 3 Oct 2025).

7. Outlook and Research Directions

Zero-shot hyperparameter transfer continues to expand in several directions: parameter-efficient transfer across domain-shifted or mixed-modality tasks, automated scaling laws for emerging model families, extension to mixed and categorical HP spaces (via hybrid models or discrete embedding approaches), and algorithmic pipelines that blend zero-shot and few-shot regimes via online adaptation of meta-surrogates. Cross-modal and multi-objective transfer remains a challenging and active area. The rapid growth of published community artifact tables, symbolic default grammars, and pretrained optimization policies establishes a foundation for both theory-driven and empirical advances in this domain.