Papers
Topics
Authors
Recent
2000 character limit reached

Cross-Dataset Hyperparameter Transfer

Updated 30 December 2025
  • The paper demonstrates that leveraging ensemble surrogate models and Bayesian optimization can cut target evaluations by up to 5×.
  • Cross-dataset hyperparameter transfer is defined as reusing empirical hyperparameter mappings across datasets to boost search efficiency and model performance.
  • Practical methods include meta-feature extraction, surrogate alignment, and portfolio selection to robustly address multi-source covariate shift and continual learning.

Cross-dataset hyperparameter transfer denotes the practice of leveraging hyperparameter optimization or search results from related datasets to accelerate, warm-start, or otherwise improve hyperparameter selection on a new dataset. This paradigm is fundamental across transfer learning, AutoML, and meta-learning, wherein prior computational effort or empirical knowledge is adaptively reused or encoded. Research spans principled transfer in Bayesian optimization, neural/surrogate-based alignment, copula models, kernel embeddings, meta-feature-driven surrogates, combinatorial portfolio selection, multi-source covariate shift, and ordered optimization in continual settings.

1. Problem Formulation and Conceptual Foundations

Cross-dataset hyperparameter transfer is rigorously defined in terms of transferring knowledge (surrogate models, priors, optimal configurations, meta-features, or empirical mappings) from a set of “source” tasks (datasets) to a new “target” task. The core problem is to estimate, for the target dataset DtD^t, the mapping from hyperparameter configuration xXx \in \mathcal{X} to objective value y=ft(x)y = f^t(x), using available data {(xis,yis)}i,s\{(x^{s}_i, y^{s}_i)\}_{i,s} from other datasets.

A simple formalization, as in ensemble Bayesian optimization (Feurer et al., 2018), is:

  • Given TT source tasks, each with evaluations {D(s)}s=1T\{\mathcal{D}^{(s)}\}_{s=1}^T, learn a surrogate or set of priors MsM^s.
  • For a new target, combine MsM^s with an online-updated MtM^t to form transfer recommendations or surrogates driving acquisition functions.

Within multi-objective or multi-fidelity settings (cf. (Terragni et al., 2022, Winkelmolen et al., 2020)), the objective may further include coherence, diversity, or multi-task performance metrics, and transfer solutions may be structured as portfolios, embedding-based surrogates, or ensembles.

2. Bayesian Optimization and Surrogate-Based Transfer

Bayesian optimization (BO) underpins much of the technical literature (e.g., (Feurer et al., 2018, Law et al., 2018, Li et al., 2022, Salinas et al., 2019, Hellan et al., 2023)). Transfer is enacted either by:

  • Constructing ensemble surrogates, such as the ranking-weighted Gaussian process ensemble (RGPE) (Feurer et al., 2018), which aggregates GP predictions from each source task weighted by performance on the target.
  • Building kernel-augmented GPs, e.g., embedding datasets into an RKHS via mean-embedding representations and conditioning the BO surrogate on both hyperparameters and data-embedding (Law et al., 2018), or using learned meta-feature extractions (Jomaa et al., 2021).
  • Implementing parametric or semi-parametric copula models mapping source empirical quantiles to a latent normal, allowing robust pooling across tasks of different scales and variances (Salinas et al., 2019).

For task-weighting, RGPE uses ranking statistics and bootstrap aggregation, while TransBO optimizes source and target combination weights jointly through supervised ranking losses and cross-validation (Li et al., 2022).

A general BO loop incorporating transfer typically: initializes with transferred priors or configurations, fits joint or ensemble surrogates, proposes new hyperparameter candidates via acquisition maximization (e.g., Expected Improvement), and updates models with target-evaluated results (Feurer et al., 2018).

3. Surrogate Alignment and Nonparametric Mapping Approaches

Methods such as surrogate alignment (HTS) (Ilievski et al., 2016) learn a direct nonlinear mapping g:XXg:\mathcal{X}\to\mathcal{X} from source to target hyperparameter optima. HTS leverages surrogate models (radial-basis function regressors with polynomial tail) of error landscapes from source and target, and trains a neural network to align them by minimizing rank correlation loss between surrogate predictions. This method is effective for DNNs where exact modeling is impractical due to high training cost, and does not require meta-features—only (x,f(x))(x, f(x)) pairs from both domains.

Experiments confirm substantial reductions (3–5×) in required target evaluations compared to non-transfer surrogates (e.g., HORD).

4. Meta-Feature and Kernel Embedding Techniques

Recent approaches employ learned dataset representations or meta-features to enable cross-dataset transfer even across heterogeneous domains (Jomaa et al., 2021, Law et al., 2018). In DMFBS (Jomaa et al., 2021), a differentiable deep-set architecture extracts embeddings from each dataset, which are jointly optimized for response regression, manifold regularization that encourages similarity across surrogate predictions for similar datasets, and a dataset-identification auxiliary task. The output embedding is concatenated with hyperparameter configuration to condition the surrogate predictor, driving acquisition and ranking for new dataset evaluation.

Distributional transfer (Law et al., 2018) utilizes kernel mean embedding of distributions PX,YP_{X,Y} into RKHS, and places a GP prior over (θ,ψ(D),s)(\theta, \psi(D), s), facilitating transfer by relating hyperparameter performance across similar datasets in feature-space.

5. Portfolio Selection and Zero-Shot Cross-Dataset Transfer

Zero-shot HPO approaches such as (Winkelmolen et al., 2020, Rijn et al., 2017), and (Terragni et al., 2022) demonstrate that a small portfolio of hyperparameter configurations can "cover" a large set of future datasets: for any unseen dataset, at least one configuration performs near-optimally.

Portfolio selection algorithms solve a combinatorial, submodular minimization of mean regret over meta-datasets, employing greedy augmentation and surrogate modeling or multi-fidelity evaluations. This yields lookup tables of recommended default configurations.

Table: Portfolio Construction Summary

Method Portfolio Construction Evaluation Approach
Greedy Submodular (Winkelmolen et al., 2020) Greedy K-set minimizing mean meta-loss Direct empirical or surrogate-based
Surrogate Adaptive Query (Winkelmolen et al., 2020) Surrogate model learned over (d,θ)(d,\theta) Bayesian optimization on meta-table
Multi-Objective BO (Terragni et al., 2022) Pareto front for coherence/diversity/classification Multi-output random scalarization

Empirical evidence attests that such portfolios, constructed from hundreds of datasets and tens of thousands of configurations, allow practitioners to skip or dramatically reduce hyperparameter search on new datasets by evaluating only the top-K recommended options.

6. Task and Dataset Similarity: Ordered and Distributional Transfer

OTHPO (Hellan et al., 2023) introduces ordered transfer for sequential tasks (e.g., increasing data sizes, continual learning), positing that recent tasks are more strongly correlated. The approach models transfer via GP surrogates over the joint space of hyperparameters and an ordered context variable (e.g., time, index, data fraction). Warm-start heuristics select best configurations from immediately prior tasks, attesting to improved "first-evaluation" regret.

Distributional embeddings or meta-feature kernels generalize this approach for unordered, meta-feature-equipped cross-dataset scenarios (Law et al., 2018, Jomaa et al., 2021, Terragni et al., 2022).

7. Practical Guidelines and Empirical Impact

Effective cross-dataset transfer depends on principled weighting of sources, careful model aggregation, attention to distributional shift, and adaptive regularization:

  • Always validate transferred configurations or priors on a held-out split to account for outliers, semantic shift, or domain distance (Dube et al., 2018).
  • Graduated (layer-wise) hyperparameter schedules and regression-based prediction of transfer scales yield superior results in deep networks (Dube et al., 2018).
  • Embedding-based or copula-normalized surrogates avoid pitfalls from raw value pooling across tasks with disparate objective scales (Salinas et al., 2019).
  • Variance-minimizing importance weighting is critical under multi-source covariate shift (Nomura et al., 2020).

Empirical studies consistently report 2–10× reductions in search time, substantial improvements in accuracy or regret over random search, and resilience to outlier tasks and distributional noise.

Cross-dataset hyperparameter transfer remains a cornerstone enabling scalable, sample-efficient, and adaptive automated model tuning in contemporary machine learning pipelines.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Cross-Dataset Hyperparameter Transfer.