Scale-Invariant Neural Network Training
- Scale-invariant neural network training is defined by methods that keep network outputs unchanged when inputs or parameters are positively rescaled, ensuring robustness to magnitude variations.
- Techniques such as orbit-canonization, scale-equivariant transforms, and multi-column architectures enable consistent performance, as demonstrated in tasks like point cloud classification and digit recognition.
- Advanced optimizers and regularizers, including scale-invariant SGD and ISS, leverage geometric insights and thermodynamic analogies to improve convergence and generalization in multiscale environments.
Scale-invariant neural network training encompasses methods and architectures designed so that the model output or the learned function remains unchanged under positive rescalings of input, parameters, or inner representations—precisely, or for all . This property is fundamental for robustness against scale variations in the input data, internal activations, or network parameters, and for learning features independent of overall magnitude. Scale invariance is relevant to architectural design, optimization, regularization, and empirical generalization, and manifests in a variety of techniques ranging from orbit-mapping layers and equivariant transforms to special regularizers and optimization approaches.
1. Mathematical Foundations of Scale Invariance
Scale invariance is formalized as 0-homogeneity: a function is scale-invariant if for all and (Li et al., 2022). For neural network layers and losses exhibiting this property, optimization dynamics, stationary distributions, and convergence proofs rely on the geometry of the sphere or on invariant statistics.
Formal group action: For instance, in point cloud classification (Gandikota et al., 2021), the scaling group acts as for point cloud , and is -invariant if for all .
Consequences: Scale-invariant losses induce gradients orthogonal to the parameter vector ( via Euler’s theorem), and optimization often reduces to the sphere (Kodryan et al., 2022). In the context of SGD plus weight decay, the stationary distribution of parameters is described by measures on the sphere, and admits a full thermodynamic analogy (Sadrtdinov et al., 10 Nov 2025).
2. Scale-Invariant Architectures
Orbit-mapping (Canonization) Layers
A direct approach is orbit canonization (Gandikota et al., 2021): pre-processing the input by mapping each orbit (under scale action) to a canonical representative, e.g., dividing a point cloud by its mean radius , so . The resulting network composition is provably scale-invariant:
Text-diagram:
1 |
X ──▶ [Compute r=Φ(X)] ──▶ [Scale by s=1/r] ──▶ X̂ ──▶ f(X̂;θ) ──▶ y |
Scale-Equivariant Transforms
Operators such as Riesz transforms (Barisin et al., 2023) and log-radial harmonics (Ghosh et al., 2019) guarantee scale equivariance or invariance at the architectural level. Riesz networks replace convolutions with learned linear combinations of Riesz-kernels, resulting in exact scale equivariance in all layers. Log-radial harmonic filter banks enable steerable scaling of CNN filters, so that features are preserved under rescaling. Empirical results show strong robustness to scale variation in crack detection and digit classification.
Multi-Column and Pyramid Architectures
SiCNN (Xu et al., 2014) and locally scale-invariant CNNs (Kanazawa et al., 2014) construct networks with multiple columns or layers, each operating at a different scale but sharing parameters via fixed, linearly transformed filter sets. Max-pooling across scaled responses achieves local scale invariance without increasing the parameter count. These architectures systematically outperform single-scale CNNs when scale variation is pronounced.
Feature-Transform and Whitening
Layer-wise feature transforms imposing per-sample normalization and batch covariance whitening achieve both scale-invariance and -invariance (Ye et al., 2021). The forward transform removes scale and mean from each sample, while a global inverse covariance ("whitening") transform produces independent outputs. This isotropizes the local Hessian, accelerating convergence and making training invariant to input scale and basis.
3. Scale-Invariant Optimization Methods
Scale-Invariant SGD and MultiAdam
Standard SGD with weight decay suffers from sensitivity to the scale of initialization and loss unless the architecture is scale-invariant (Li et al., 2022). For scale-invariant networks, the parameter norm equilibrates naturally, enabling robust training. The MultiAdam optimizer (Yao et al., 2023) splits the objective into groups (e.g., PDE residuals and boundary losses in PINNs) and maintains per-group first and second moment statistics, adaptively balancing loss gradients under domain rescaling:
Pseudocode:
1 2 3 4 5 6 7 8 |
for t in 1..T: for i in 1..G: g_{t,i} = ∇_θ f_i(θ_{t−1}) m_{t,i} = β₁ m_{t−1,i} + (1−β₁) g_{t,i} v_{t,i} = β₂ v_{t−1,i} + (1−β₂) g_{t,i}² \hat m_{t,i} = m_{t,i}/(1−β₁^t) \hat v_{t,i} = v_{t,i}/(1−β₂^t) θ_t ← θ_{t−1} − (γ/G) ∑_{i=1}^G \hat m_{t,i}/(√{\hat v_{t,i}+ε}) |
Thermodynamic Perspective
SGD dynamics with weight decay on scale-invariant networks correspond exactly to an ideal gas process (Sadrtdinov et al., 10 Nov 2025). The training hyperparameters map onto thermodynamic quantities: learning rate to temperature , weight decay to pressure , and norm to volume . The stationary entropy and parameter radius follow ideal gas predictions, providing a principled foundation for hyperparameter scheduling and generalization analysis.
4. Scale-Invariant Regularization and Sparsification
Weight Scale Shifting Invariant (ISS) Regularizers
ISS regularizers are invariant to layer-wise weight scaling shifts due to positive homogeneity (Liu et al., 2020). The regularization term
penalizes the product of layer norms and a normalized term, constraining the intrinsic norm of the network. This formulation upper-bounds the input gradient norm and enhances adversarial robustness and generalization compared to conventional weight decay.
Scale-Invariant Sparsity Penalties
DeepHoyer (1908.09979) introduces differentiable, scale-invariant sparsity measures based on Hoyer's ratio , such as Hoyer-Square and Group-Hoyer penalties. These regularizers induce sparsity proportionally across elements/groups, outperforming and ADMM-based approaches for both element-wise and structural pruning while maintaining scale invariance and differentiability.
5. Empirical Performance, Limitations, and Guidelines
Empirical robustness: Canonical orbit mapping (Gandikota et al., 2021), SiCNN (Xu et al., 2014), Riesz networks (Barisin et al., 2023), and log-radial harmonics (Ghosh et al., 2019) achieve high accuracy and robustness in classification tasks on test sets augmented with severe scale transformations. ISS regularization (Liu et al., 2020) and DeepHoyer (1908.09979) yield improved generalization and sparsity, as well as increased adversarial robustness.
Guidelines for practitioners: Effective scale-invariant training requires
- canonicalization of inputs or activations where possible,
- strict architectural enforcement of scale equivariance/invariance,
- use of optimizers sensitive to group-wise scale (MultiAdam) or global norms,
- careful regularization imposing true invariance to scale shifts,
- explicit monitoring of the effective learning rate on the sphere for normalization-based architectures (Kodryan et al., 2022),
- tuning step sizes according to the effective temperature (thermodynamic mapping) (Sadrtdinov et al., 10 Nov 2025).
Limitations: Scale-invariant designs may be sensitive to features with zero or negative values (log-based branches), increased inference cost (multi-column, or steerable filter variants), or require a priori identification of which features/parameters should be invariant. Not all forms of invariance extend to categorical or sign-flipping perturbations (Petrozziello et al., 2 Oct 2024).
6. Connections to Broader Invariance Learning
Scale invariance is one instance of learning or incorporating group invariance in neural networks (Gandikota et al., 2021). Similar frameworks, such as rotation, reflection, or affine invariance, can be constructed via analogous orbit-mapping, equivariant transforms, or parameter tying. Hybrid approaches (e.g., scale-invariant learning-to-rank (Petrozziello et al., 2 Oct 2024)) split features into trusted and sensitive branches, leveraging invariance for robustness against mismatch between training and inference scales. Multi-scale strategies (Noord et al., 2016), which combine variant and invariant features, further improve generalization in practice.
7. Outlook and Current Research Directions
Recent work seeks to generalize scale-invariant principles to more complex architectural patterns, group actions, and training objectives, including equivariant normalization, unsupervised representation learning, multi-task settings, generative models, and physics-informed neural networks. Thermodynamic analogies (Sadrtdinov et al., 10 Nov 2025) offer new perspectives on hyperparameter tuning and ensemble strategies. Ongoing research addresses computational trade-offs, extension to high-dimensional scientific data, and unifying frameworks for invariance under arbitrary continuous groups.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free