Papers
Topics
Authors
Recent
Search
2000 character limit reached

Zero-shot Hyperparameter Transfer

Updated 17 March 2026
  • Zero-shot hyperparameter transfer is a family of methods that configures hyperparameters for unseen tasks using meta-data and scale-aware techniques.
  • It leverages approaches such as μTransfer, meta-learned surrogates, and ensemble strategies to enable robust, cost-effective hyperparameter selection.
  • Empirical benchmarks indicate that these methods reduce tuning costs and improve transfer speed while maintaining competitive accuracy across diverse domains.

Zero-shot hyperparameter transfer refers to a family of methodologies that enable the selection or configuration of hyperparameters for new tasks, datasets, model scales, or domains, without explicit search or tuning on the target instance. Instead, transfer occurs via principled statistical, meta-learning, or scale-aware mechanisms that leverage prior information—whether learned surrogates, scale invariance in the optimization landscape, or model-averaged ensembles—to deliver high-performing configurations in the absence of direct target supervision or target validation feedback.

1. Foundational Problem and Formal Definitions

The canonical hyperparameter optimization (HPO) objective is, for a given algorithm AA, dataset DD, and hyperparameter space Λ\Lambda, to find λ=argminλΛR(λ;D)\lambda^* = \arg\min_{\lambda \in \Lambda} R(\lambda; D), where RR is the expected risk or held-out loss induced by AA trained on DD with configuration λ\lambda. Zero-shot hyperparameter transfer generalizes this by seeking, for a collection of prior datasets {D1,...,DK}\{D_1, ..., D_K\} (or tasks, model scales), either:

  • A static configuration λ\lambda^*, or more generally
  • A mapping λ(D)\lambda(D) from dataset/scale descriptors to Λ\Lambda,

that minimizes the average or expected risk across new, unseen DD: EDP[R(λ(D);D)]E_{D \sim \mathcal{P}}[R(\lambda(D); D)]. Zero-shot protocols disallow any target-label or target-validation feedback; all information flow occurs through either prior hyperparameter-performance tuples, structured meta-data, or parameterization-induced invariance (e.g., scale-aware transfer) (Gijsbers et al., 2021, Winkelmolen et al., 2020, Yang et al., 2022, Schmidt et al., 2023, Ghosh et al., 28 Dec 2025, Meindl et al., 3 Oct 2025).

2. Approaches and Algorithmic Schemes

Zero-shot hyperparameter transfer comprises several principal categories, each supported by rigorous empirical and theoretical analysis:

2.1 Scale-Aware (Parametrization-Based) Transfer

  • Maximal Update Parametrization (μP): μP reparameterizes neural networks such that both activations and per-step parameter updates remain O(1)O(1) as width mm \to \infty. Under μP, optimal hyperparameters (learning rates, optimizer coefficients, initialization scales) empirically and theoretically converge to scale-stable optima, enabling zero-shot transfer from a proxy (small) model to large-scale targets. This holds for Transformers, MLPs, ResNets, and extends to operator regimes (Fourier Neural Operators) with modified scaling laws (Yang et al., 2022, Li et al., 24 Jun 2025, Ghosh et al., 28 Dec 2025).
  • Algorithmic Procedure:
    • Parametrize the large target model in μP.
    • Tune all μP-transferable hyperparameters on a smaller proxy (width ≥ 256, batch size, seq len, etc).
    • Directly apply the same hyperparameters to the large-scale model; no further tuning.
  • Scaling Law for FNOs: For Fourier Neural Operators, the μTransfer-FNO scheme rescales initialization variance and learning rate proportional to 1/dlogK1/\sqrt{d\log K}, where KK is the number of Fourier modes and dd is input dimension (Li et al., 24 Jun 2025).

2.2 Surrogate, Meta-Learned, and Ensemble Methods

  • Meta-Learned Symbolic Defaults: Symbolic or parametric mappings from dataset meta-features (e.g., size, feature type counts, class imbalance) to HP configurations, learned over pools of prior problems using surrogate losses and grammar-based symbolic regression (Gijsbers et al., 2021, Winkelmolen et al., 2020).
  • Zero-shot AutoML and Meta-surrogates: Deep surrogates (typically MLPs) predict the rank or expected performance of any HP configuration given only simple meta-features of the target dataset (e.g., image resolution, class count), jointly trained on extensive meta-datasets of pipeline/dataset evaluations under a pairwise ranking loss (Öztürk et al., 2022).
  • Ensemble Surrogates for BO: Bayesian optimization methods aggregate posterior predictions from prior-task GPs (Gaussian Processes) via ranking-weighted, bootstrap-based mixtures, providing hyperparameter recommendations or acquisition functions that adaptively leverage past HP landscapes in zero-shot or few-shot regimes (Feurer et al., 2018).
  • Zero-shot Combinatorial Set Construction: Greedy submodular optimization on large meta-tables of loss evaluations across many datasets and configurations, yielding small lists of zero-shot default HPs with the guarantee that one performs near-optimally on any new task (Winkelmolen et al., 2020).

2.3 Model-Averaging and Ensembling for Robust Transfer

  • Accumulative Model Averaging: For multilingual or cross-domain transfer, run-by-run model averaging across distinct hyperparameter realizations (rather than selection by source-dev or target-dev validation) yields superior zero-shot cross-lingual transfer (ZS-XLT), outperforming both single-run and “model soup” baselines even without target labels (Schmidt et al., 2023).

3. Theoretical Guarantees and Transfer Conditions

  • Fast Transfer and Scale Invariance: The formal notion of fast HP transfer states that the suboptimality penalty of transferred HPs, Δn=o(Gn)\Delta_n = o(G_n), where GnG_n is the inherent finite-model performance gap. This property holds under mild convexity when the optimum HP itself converges to its infinite-scale counterpart at rate bn=o(an1/2)b_n = o(a_n^{1/2}), an=Gna_n = G_n (Ghosh et al., 28 Dec 2025). Fast transfer yields compute-optimal tuning: transferred HP search outperforms direct search on large scale when this criterion holds.
  • Synthetic Positive and Negative Examples: Positive cases (ridge regression, μP-MLE) show both empirical and theoretical verification of fast transfer, while counterexamples (e.g., two-layer ReLU “ball indicator” with high nonlinearity) break this condition (Ghosh et al., 28 Dec 2025).

4. Empirical Benchmarks and Comparative Results

Zero-shot transfer has been validated across diverse modalities, scales, and experimental setups:

Domain Methodology Target Metric Zero-Shot Transfer Performance Reference
NLP (BERT, GPT-3, XLM-R) μTransfer (μP) GLUE, BLEU, F1 Outperforms direct large-scale tuning (Yang et al., 2022)
PDEs (FNO, PINO) μTransfer-FNO L² Error ≤ Direct, 70% tuning cost reduction (Li et al., 24 Jun 2025)
Cross-lingual NLU Accum. model averaging Accuracy/F1 +0.2–5.0pp over source-dev selection (Schmidt et al., 2023)
Deep Learning (vision) Zero-shot AutoML/ZAP Area/LearningCurve ALC +0.09 over best challenge baseline (Öztürk et al., 2022)
Meta-ML (tabular) Symbolic defaults Log-loss, rank Matches 8–16× random search (Gijsbers et al., 2021)
Black-box Opt. (HPO-Bench) ZeroShotOpt (offline RL) Norm. regret PP 0.885±0.0090.885\pm0.009 (BO: 0.900±0.0070.900\pm0.007) (Meindl et al., 3 Oct 2025)
BO, various UCI/Kaggle tasks Ensemble surrogate (RGPE) Regret 2–5× speedup over vanilla BO (Feurer et al., 2018)

Zero-shot methods outperform baselines that select static “best” configurations, random selection, or challenge-winning AutoML systems. For large-scale NNs and PDEs, scale-aware zero-shot transfer eliminates the need for wide HP sweeps, and for black-box HPO, trained RL policies generalize to unseen optimization landscapes without additional tuning.

5. Limitations and Caveats

  • Domain and Task Specificity: Surrogate- and meta-feature-based zero-shot transfer is limited by the coverage and representativeness of the meta-dataset; performance may degrade on substantial domain shifts (Öztürk et al., 2022, Gijsbers et al., 2021).
  • Regularizer and Conditional HPs: Some hyperparameters, such as dropout, weight decay, or domain-specialized regularizers, are not reliably transferable and require re-tuning post-transfer even in μP-based protocols (Yang et al., 2022, Li et al., 24 Jun 2025).
  • Model Dynamics and Decomposition: Fast transfer is empirically contingent on optimization dynamics concentrating loss reduction in a width-stable, low-rank subspace. For complex tasks or pathological loss surfaces, transferred optima can drift, and transfer does not outperform direct tuning (Ghosh et al., 28 Dec 2025).

6. Best Practices, Practical Recipes, and Community Artifacts

  • μP/Scale-Aware Transfer: Parametrize models in μP and tune transferable HPs only on proxies with minimum recommended width and batch size; empirically verify transfer by loss curve invariance and top-kk subspace inspection (Yang et al., 2022, Ghosh et al., 28 Dec 2025).
  • Meta-learned Surrogates and Symbolic Defaults: Deploy surrogates mapping trivial meta-features (4–8 dimensions) to HP configurations, learned via ranking loss or evolutionary search on large prior meta-datasets (Gijsbers et al., 2021, Öztürk et al., 2022).
  • Zero-shot HPO Lists: Practitioners can utilize precomputed lists of KK default configs (lookup tables published with method results), executing all in parallel and selecting the best by target validation (Winkelmolen et al., 2020).
  • Model Averaging for Cross-Lingual/NLP: Average r=510r=5-10 checkpoints from a diverse HP grid, yielding robust transfer in zero-shot cross-lingual scenarios (Schmidt et al., 2023).
  • RL-Based Optimizers: Employ offline-pretrained transformer-based optimizers such as ZeroShotOpt for continuous black-box HPO without acquisition-function configuration or kernel selection (Meindl et al., 3 Oct 2025).

7. Outlook and Research Directions

Zero-shot hyperparameter transfer continues to expand in several directions: parameter-efficient transfer across domain-shifted or mixed-modality tasks, automated scaling laws for emerging model families, extension to mixed and categorical HP spaces (via hybrid models or discrete embedding approaches), and algorithmic pipelines that blend zero-shot and few-shot regimes via online adaptation of meta-surrogates. Cross-modal and multi-objective transfer remains a challenging and active area. The rapid growth of published community artifact tables, symbolic default grammars, and pretrained optimization policies establishes a foundation for both theory-driven and empirical advances in this domain.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Zero-shot Hyperparameter Transfer.