Learning Rate Transferability
- Learning rate transferability is the reuse of source learning rate policies to accelerate and enhance optimization on target tasks.
- Layer-wise adaptation adjusts individual learning rates using discrepancy metrics, like attention maps and Hessian spectra, to boost convergence.
- Meta-learned schedules, such as LSTM-based controllers, facilitate parameterized learning rate transfer across varied data modalities and architectures.
Learning rate transferability refers to the ability to leverage learning rate (LR) configurations—either as explicit schedules or as layer-wise assignments—obtained on a source domain or task, in order to accelerate and improve the optimization of a target task, particularly in transfer learning scenarios. This property is crucial for efficient adaptation across data modalities, architectures, dataset sizes, and varying task complexities, impacting both convergence speed and statistical performance when reusing pre-trained models or meta-learned optimizers.
1. Problem Formulation and Theoretical Foundations
Learning rate transferability emerges within supervised transfer learning, knowledge distillation, and meta-learning settings. The central question is: given an effective LR policy—whether a global decay, a meta-learned schedule, or a vector of per-layer rates—does reusing or adapting this policy on a different, but related, target task provide measurable gains in training efficiency or accuracy?
Formally, a transfer learning scenario consists of a teacher model with parameters and a student model with parameters (often ). Optimization follows a composite loss , where is the standard task loss (e.g., cross-entropy), and is a layerwise distillation loss, possibly defined in terms of attention maps, Jacobians, or Hessians at each layer. The goal is to optimize each layer with a specific , which may be derived from discrepancies between teacher and student representations (Kokane et al., 5 Jul 2024).
Theoretical analysis of convergence rates under transfer learning exposes phase transitions: if the amount or informativeness of source data exceeds a threshold, transfer can strictly accelerate convergence (minimax excess risk improves as where is source sample size); otherwise, reliance on transfer may hurt or show no benefit (Reeve et al., 2021).
2. Layer-Wise Learning Rate Adaptation and Transfer
Data-driven or heuristically tuned layer-wise LRs can be critical for successful transfer learning. Instead of a single global LR, a vector is maintained, with each dynamically adjusted based on the per-layer discrepancy .
One prominent approach computes, at each crucial layer, a discrepancy metric (e.g., Jensen–Shannon divergence between student and teacher attention maps, Jacobian vectors, or Hessian spectra), aggregates these over the dataset, and periodically applies a momentum-smoothed update to :
This decouples adaptation across layers and enables "warm-starting" of the entire set on new tasks sharing similar architectures and feature dimensions. In empirical studies, such checkpointing reduced convergence time by 20% and offered an immediate 1–2% improvement in student accuracy relative to naïve global schedules (Kokane et al., 5 Jul 2024).
3. Meta-Learned and Parameterized Learning Rate Schedules
Meta-learned optimizers pursue transferability by parameterizing the LR schedule as a mapping , e.g., using a compact LSTM network as in MLR-SNet: Meta-training proceeds via bilevel optimization: the inner loop applies SGD using generated by the controller, while the outer loop updates controller parameters to minimize validation loss after inner steps. Importantly, the learned can be re-used directly at meta-test on new tasks—of differing data types, architectures, or sample size—yielding convergence and final accuracy comparable or superior to hand-tuned baselines, without further tuning (Shu et al., 2020). Theoretical guarantees are established: if is -smooth and satisfies Polyak–Łojasiewicz, the convergence rate with MLR-SNet matches that of the best-known SGD algorithms up to logarithmic factors.
4. Practical Layer-Wise Tuning Strategies and Empirics
Systematic experimental work demonstrates that tuning learning rates across network layers is a substantial driver of transfer learning success. For example, when fine-tuning ResNet-27 between diverse subsets of ImageNet22K and Oxford Flowers, varying the last layer LR (with inner layers frozen) produced up to 127.8% relative accuracy gains. Introducing graduated layer-wise multipliers (e.g., for conv1–conv5, $16$ for the final FC layer) yielded consistently higher accuracy than uniform assignment, with modest additional tuning effort.
A critical empirical observation is the association between optimal inner-layer LR and the "images per label" statistic: denser target datasets admit higher inner-layer LRs. On 70 real-world image transfer tasks, a fixed graduated schedule outperformed baseline by 1.6 percentage points on average. These findings support the adoption of per-layer LR schedules as a robust default in transfer learning pipelines, with further performance available via lightweight per-task scaling sweeps (Dube et al., 2018).
5. Quantitative Impact of Learning Rate Transferability
Direct empirical comparisons isolate the effect of learning rate transfer techniques. In multi-dataset transfer (e.g., CIFAR-10, CIFAR-100, COCO):
- All distillation methods (attention, Jacobian, Hessian) benefit from layer-wise LR updates, more so as task difficulty increases.
- On CIFAR-100, Jacobian-based distillation improved student accuracy from (no LR schedule) to (layer-wise LR), with global LR at .
- On COCO, moving from no LR scheduling to layer-wise LRs improved accuracy by $2$–$3$ points for all distillation modalities (Kokane et al., 5 Jul 2024).
MLR-SNet, evaluated across image and language domains, demonstrated consistent transferability: LSTM-based LR controllers trained on CIFAR-10 performed competitively, without re-tuning, on SVHN, TinyImageNet, and Penn Treebank, as well as on unseen network architectures and even corrupted data conditions (Shu et al., 2020).
6. Conditions and Limits of Learning Rate Transferability
Successful LR transfer requires structural or statistical alignment between source and target tasks:
- Similarity of feature-space dimensionality and distribution, especially for per-layer scheduling (Kokane et al., 5 Jul 2024).
- Sufficient momentum (e.g., ) in running averages to dampen noisy per-layer discrepancies.
- Regular, but not overly frequent, LR updates to avoid oscillation or instability.
- In meta-learning, the parameterized LR schedule must generalize across the range of encountered training losses and dynamic regimes (Shu et al., 2020).
When these conditions are not met—e.g., if the target task diverges vastly in data statistics, complexity, or model structure—LR transfer may deliver no benefit or even degrade performance (cf. regime analysis of (Reeve et al., 2021)). A practical guideline is to automatically calibrate LR schedules in the target setting, utilizing source policies as strong initializations.
7. Synthesis and Perspectives
Learning rate transferability integrates algorithmic advances in per-layer adaptation, meta-learned scheduling, and empirical heuristics for hyperparameter selection. The approach is scalable across architectures and tasks, with both theoretically-grounded and experimentally-validated benefits. Key research, including layer-wise adaptive distillation (Kokane et al., 5 Jul 2024), LSTM-based meta-schedulers (Shu et al., 2020), and structured empirical investigations (Dube et al., 2018), establish that rational transfer and tuning of learning rates is a principal driver in realizing the full potential of transfer learning and knowledge distillation.
A plausible implication is that continued progress in this domain will require more expressive, yet robust, parameterizations of LR policies, coupled with careful regularization or calibration mechanisms to maintain transferability as task heterogeneity increases.