Relative Overtuning
- Relative overtuning is a normalized metric that measures the lost test set performance gain during hyperparameter optimization when noisy validation scores mislead the selection of the best configuration.
- Empirical studies show relative overtuning is prevalent, especially on small datasets and with simple holdout validation or high-variance metrics, occurring severely in about 10% of hyperparameter optimization runs.
- Mitigation involves using sophisticated resampling like repeated cross-validation, increasing dataset size when possible, selecting robust performance metrics such as log loss, and employing robust hyperparameter optimization techniques.
Relative overtuning is a normalized metric that quantifies the degree of overfitting occurring during hyperparameter optimization (HPO) when validation error estimates mislead the search towards suboptimal generalization performance. Unlike overfitting at the model training level, overtuning refers specifically to the HPO process, framing the extent to which an HPO procedure forfeits potential improvement by selecting a hyperparameter configuration (HPC) that appears optimal on validation data but performs worse on the test set than previous candidates. Relative overtuning makes these effects directly comparable across different tasks, datasets, and evaluation metrics.
1. Formal Definition
Let denote the sequence of HPCs evaluated by HPO, and $*\!_t$ the incumbent at step with the best validation error so far: $*\!_t := \arg\min_{\lambda \in \{\lambda_1, ..., \lambda_t\}} \widehat{\mathrm{val}}(\lambda)$ where denotes the resampled validation error estimate (e.g., from holdout or cross-validation).
The true generalization error (test error) of $*\!_t$ is denoted $\mathrm{test}(*\!_t)$. The (absolute) overtuning at step is: $\mathrm{ot}_t(\lambda_1, ..., \lambda_T) = \mathrm{test}(*\!_t) - \min_{*\!_{t'} \in \{*\!_1, ..., *\!_t\}} \mathrm{test}(*\!_{t'})$ Relative overtuning is the normalized version: $\widetilde{\mathrm{ot}}_t(\lambda_1, ..., \lambda_T) = \frac{ \mathrm{ot}_t(\lambda_1, ..., \lambda_T) }{ \mathrm{test}(*\!_1) - \min_{*\!_{t'} \in \{*\!_1, ..., *\!_t\}} \mathrm{test}(*\!_{t'}) }$
- A value of $0$ indicates no overtuning (the selected HPC generalizes as well as the best seen so far).
- $0.1$ means 10% of possible HPO gains have been lost to overtuning.
- $1$ or more indicates all or more than all the progress achieved by HPO has been negated.
2. Prevalence and Severity in Empirical Studies
Large-scale reanalyses of benchmark HPO datasets (FCNet, LCBench, WDTB, TabZilla, TabRepo, reshuffling, PD1) display the following patterns:
- ~60% of HPO runs show zero overtuning ().
- ~70% have , thus only mild overtuning.
- ~90% have (HPO progress not entirely lost).
- However, in ~10% of runs, severe overtuning () is observed, where HPO selects an HPC with worse test error than the default or first tried configuration.
The prevalence and severity display strong heterogeneity:
- Negligible overtuning in FCNet.
- Overtuning in more than 50% of runs for TabRepo and reshuffling benchmarks, with >15% severe overtuning.
- Small datasets, simple holdout validation, and high-variance metrics are associated with higher overtuning rates.
3. Determinants and Risk Factors
Multiple factors influence the likelihood and magnitude of relative overtuning:
a) Performance Metric:
- Accuracy and ROC AUC, with higher selection variance, are more susceptible.
- Log loss is more robust against overtuning.
- Multiclass tasks with many classes experience less overtuning, consistent with theoretical expectations of reduced overfitting via repeated test reuse.
b) Resampling Strategy:
- Holdout validation is associated with increased overtuning, especially in small datasets.
- Repeated cross-validation (e.g., 5×5-fold CV) substantially mitigates overtuning.
- Reshuffling resampling splits during HPO can help reduce the effect.
c) Dataset Size:
- Smaller datasets (e.g., ) suffer markedly higher overtuning rates.
- Increasing training sample size (, ) reduces both the frequency and severity.
d) Learning Algorithm:
- Elastic Net is comparatively robust.
- Highly flexible models such as CatBoost, Funnel MLP, XGBoost, and neural networks are more prone to overtuning under data scarcity or noisy validation.
e) HPO Method:
- Random Search increases overtuning, particularly under noisy evaluation.
- Bayesian Optimization (e.g., HEBO, SMAC3) slightly raises the odds of any overtuning but reduces the severity versus Random Search, likely due to its surrogate noise modeling.
Regression modeling confirms that repeated CV almost halves overtuning risk and effect size, especially on small data. Longer HPO runs increase overtuning to a point, after which effects plateau.
4. Mitigation Approaches
Several practical strategies can limit overtuning, especially in the small-data regime:
- Sophisticated Resampling: Prefer repeated cross-validation over holdout.
- Larger Training Sets: Where possible, increasing dataset size reduces risk.
- Metric Selection: Use robust metrics such as log loss.
- Robust HPO Techniques: Bayesian Optimization with regularization and noise awareness outperforms Random Search with respect to overtuning.
- Dynamic Resampling Splits: Periodically reshuffling partitioning during HPO helps avoid over-optimization to a specific validation split.
- Early Stopping: Detect overtuning early using robust estimators and halt HPO if overtuning is indicated, though gains are incremental if overtuning is already mild.
- Robust Incumbent Selection: Selection based on robust criteria or a dedicated selection set can help, but overly conservative approaches (e.g., holding out large selection sets) may impair performance due to less training data.
- Adaptive Resampling/Racing: Dynamically allocate more evaluation budget to promising HPCs and eliminate poor ones early.
Implementation of such methods may raise computational demands.
5. Relationship to Meta-Overfitting and Related Notions
Relative overtuning should be distinguished from related concepts:
- Meta-Overfitting: The discrepancy between validation estimate and true generalization of the final HPC:
$\mathrm{of}_t = \mathrm{test}(*\!_t) - \widehat{\mathrm{val}}(*\!_t)$
Nonzero meta-overfitting is necessary but not sufficient for overtuning; overtuning specifically concerns missing a better HPC from prior search steps.
- Trajectory Test Regret: Compares the selection to the best true test performance among all already tried HPCs.
- Oracle Test Regret: The gap from a global optimum, often unknown in practice.
Relative overtuning is focused on the wasted opportunity caused by over-optimization on noisy validation scores—i.e., misranking due to selection, not bias in error estimation.
6. Practical Implications and Recommendations
Relative overtuning has direct consequences for HPO reliability and interpretability:
- Overtuning is sufficiently common to warrant routine consideration, especially in AutoML and small-data contexts.
- Small datasets pose the greatest risk; cautious strategies should be prioritized.
- Reporting only validation/optimization performance is misleading: always report test set results post-HPO.
- Computationally robust protocols (e.g., repeated cross-validation) favor reliability over marginal efficiency.
- Model class and metric selection affect susceptibility; log loss and tree-based models are less vulnerable, while flexible models and high-variance metrics present more risk.
- Transparency in HPO selection protocols is critical; users must understand how optimization and selection are being performed.
- No universal solution exists; best practices focus on reducing frequency and severity, not total elimination.
Recommendations: Employ robust evaluation, increase data size when feasible, monitor for overtuning, and treat validation scores as inherently noisy. Further work is required to develop scalable yet reliable overtuning mitigation.
In summary, relative overtuning operationalizes the notion of overfitting within HPO trajectories, providing a normalized, principled quantification of lost generalization improvement. Its prevalence, heterogeneity, and susceptibility to methodological choices necessitate robust mitigation and careful reporting to ensure trustworthy hyperparameter selection.