Scale-Invariant Neural Network Training

Updated 20 November 2025

Scale-invariant neural network training is defined by methods that keep network outputs unchanged when inputs or parameters are positively rescaled, ensuring robustness to magnitude variations.
Techniques such as orbit-canonization, scale-equivariant transforms, and multi-column architectures enable consistent performance, as demonstrated in tasks like point cloud classification and digit recognition.
Advanced optimizers and regularizers, including scale-invariant SGD and ISS, leverage geometric insights and thermodynamic analogies to improve convergence and generalization in multiscale environments.

Scale-invariant neural network training encompasses methods and architectures designed so that the model output or the learned function remains unchanged under positive rescalings of input, parameters, or inner representations—precisely, $f(c\cdot x) = f(x)$ or $L(c\cdot\theta) = L(\theta)$ for all $c>0$ . This property is fundamental for robustness against scale variations in the input data, internal activations, or network parameters, and for learning features independent of overall magnitude. Scale invariance is relevant to architectural design, optimization, regularization, and empirical generalization, and manifests in a variety of techniques ranging from orbit-mapping layers and equivariant transforms to special regularizers and optimization approaches.

1. Mathematical Foundations of Scale Invariance

Scale invariance is formalized as 0-homogeneity: a function $f:\mathbb{R}^n\to\mathbb{R}$ is scale-invariant if $f(c\theta) = f(\theta)$ for all $c>0$ and $\theta\in\mathbb{R}^n$ (Li et al., 2022). For neural network layers and losses exhibiting this property, optimization dynamics, stationary distributions, and convergence proofs rely on the geometry of the sphere or on invariant statistics.

Formal group action: For instance, in point cloud classification (Gandikota et al., 2021), the scaling group $G=\mathbb{R}^+$ acts as $g\cdot X = gX$ for point cloud $X\in\mathbb{R}^{d\times N}$ , and $f$ is $G$ -invariant if $f(g\cdot X)=f(X)$ for all $g$ .

Consequences: Scale-invariant losses induce gradients orthogonal to the parameter vector ( $\theta^\top\nabla L(\theta)=0$ via Euler’s theorem), and optimization often reduces to the sphere $S^{n-1}$ (Kodryan et al., 2022). In the context of SGD plus weight decay, the stationary distribution of parameters is described by measures on the sphere, and admits a full thermodynamic analogy (Sadrtdinov et al., 10 Nov 2025).

2. Scale-Invariant Architectures

Orbit-mapping (Canonization) Layers

A direct approach is orbit canonization (Gandikota et al., 2021): pre-processing the input by mapping each orbit (under scale action) to a canonical representative, e.g., dividing a point cloud $X$ by its mean radius $\Phi(X) = \frac{1}{N}\sum_{i=1}^N \|x_i\|_2$ , so $X\mapsto X/\Phi(X)$ . The resulting network composition $f\circ c$ is provably scale-invariant:

Text-diagram:

1	X ──▶ [Compute r=Φ(X)] ──▶ [Scale by s=1/r] ──▶ X̂ ──▶ f(X̂;θ) ──▶ y

This strategy facilitates exact scale invariance with negligible computational overhead, demonstrated empirically on the ModelNet40 benchmark, where PointNet with canonization achieves

86.1\%

accuracy uniformly across a

c\in[0.001,1000]

test-time scale multiplicative perturbation range.

Scale-Equivariant Transforms

Operators such as Riesz transforms (Barisin et al., 2023) and log-radial harmonics (Ghosh et al., 2019) guarantee scale equivariance or invariance at the architectural level. Riesz networks replace convolutions with learned linear combinations of Riesz-kernels, resulting in exact scale equivariance in all layers. Log-radial harmonic filter banks enable steerable scaling of CNN filters, so that features are preserved under rescaling. Empirical results show strong robustness to scale variation in crack detection and digit classification.

Multi-Column and Pyramid Architectures

SiCNN (Xu et al., 2014) and locally scale-invariant CNNs (Kanazawa et al., 2014) construct networks with multiple columns or layers, each operating at a different scale but sharing parameters via fixed, linearly transformed filter sets. Max-pooling across scaled responses achieves local scale invariance without increasing the parameter count. These architectures systematically outperform single-scale CNNs when scale variation is pronounced.

Feature-Transform and Whitening

Layer-wise feature transforms imposing per-sample normalization and batch covariance whitening achieve both scale-invariance and $GL(n)$ -invariance (Ye et al., 2021). The forward transform removes scale and mean from each sample, while a global inverse covariance ("whitening") transform produces independent outputs. This isotropizes the local Hessian, accelerating convergence and making training invariant to input scale and basis.

3. Scale-Invariant Optimization Methods

Scale-Invariant SGD and MultiAdam

Standard SGD with weight decay suffers from sensitivity to the scale of initialization and loss unless the architecture is scale-invariant (Li et al., 2022). For scale-invariant networks, the parameter norm equilibrates naturally, enabling robust training. The MultiAdam optimizer (Yao et al., 2023) splits the objective into groups (e.g., PDE residuals and boundary losses in PINNs) and maintains per-group first and second moment statistics, adaptively balancing loss gradients under domain rescaling:

Pseudocode:

for t in 1..T:
    for i in 1..G:
        g_{t,i} = ∇_θ f_i(θ_{t−1})
        m_{t,i} = β₁ m_{t−1,i} + (1−β₁) g_{t,i}
        v_{t,i} = β₂ v_{t−1,i} + (1−β₂) g_{t,i}²
        \hat m_{t,i} = m_{t,i}/(1−β₁^t)
        \hat v_{t,i} = v_{t,i}/(1−β₂^t)
    θ_t ← θ_{t−1} − (γ/G) ∑_{i=1}^G \hat m_{t,i}/(√{\hat v_{t,i}+ε})

Empirically, MultiAdam achieves 1–2 orders of magnitude better predictive accuracy than Adam or other weighting schemes on strongly multiscale PDE benchmarks.

Thermodynamic Perspective

SGD dynamics with weight decay on scale-invariant networks correspond exactly to an ideal gas process (Sadrtdinov et al., 10 Nov 2025). The training hyperparameters map onto thermodynamic quantities: learning rate $\eta$ to temperature $T$ , weight decay $\lambda$ to pressure $p$ , and norm to volume $V$ . The stationary entropy and parameter radius follow ideal gas predictions, providing a principled foundation for hyperparameter scheduling and generalization analysis.

4. Scale-Invariant Regularization and Sparsification

Weight Scale Shifting Invariant (ISS) Regularizers

ISS regularizers are invariant to layer-wise weight scaling shifts due to positive homogeneity (Liu et al., 2020). The regularization term

$\Omega_{ISS}(W) = \lambda_e\prod_{\ell=1}^{L+1} \|W_\ell\|_2^2 + \lambda_c\sum_{\ell=1}^{L+1} \|W_\ell/\|W_\ell\|_2\|_1$

penalizes the product of layer norms and a normalized $\ell_1$ term, constraining the intrinsic norm of the network. This formulation upper-bounds the input gradient norm and enhances adversarial robustness and generalization compared to conventional weight decay.

Scale-Invariant Sparsity Penalties

DeepHoyer (1908.09979) introduces differentiable, scale-invariant sparsity measures based on Hoyer's ratio $\|w\|_1/\|w\|_2$ , such as Hoyer-Square and Group-Hoyer penalties. These regularizers induce sparsity proportionally across elements/groups, outperforming $\ell_1$ and ADMM-based approaches for both element-wise and structural pruning while maintaining scale invariance and differentiability.

5. Empirical Performance, Limitations, and Guidelines

Empirical robustness: Canonical orbit mapping (Gandikota et al., 2021), SiCNN (Xu et al., 2014), Riesz networks (Barisin et al., 2023), and log-radial harmonics (Ghosh et al., 2019) achieve high accuracy and robustness in classification tasks on test sets augmented with severe scale transformations. ISS regularization (Liu et al., 2020) and DeepHoyer (1908.09979) yield improved generalization and sparsity, as well as increased adversarial robustness.

Guidelines for practitioners: Effective scale-invariant training requires

canonicalization of inputs or activations where possible,
strict architectural enforcement of scale equivariance/invariance,
use of optimizers sensitive to group-wise scale (MultiAdam) or global norms,
careful regularization imposing true invariance to scale shifts,
explicit monitoring of the effective learning rate on the sphere for normalization-based architectures (Kodryan et al., 2022),
tuning step sizes according to the effective temperature (thermodynamic mapping) (Sadrtdinov et al., 10 Nov 2025).

Limitations: Scale-invariant designs may be sensitive to features with zero or negative values (log-based branches), increased inference cost (multi-column, or steerable filter variants), or require a priori identification of which features/parameters should be invariant. Not all forms of invariance extend to categorical or sign-flipping perturbations (Petrozziello et al., 2 Oct 2024).

6. Connections to Broader Invariance Learning

Scale invariance is one instance of learning or incorporating group invariance in neural networks (Gandikota et al., 2021). Similar frameworks, such as rotation, reflection, or affine invariance, can be constructed via analogous orbit-mapping, equivariant transforms, or parameter tying. Hybrid approaches (e.g., scale-invariant learning-to-rank (Petrozziello et al., 2 Oct 2024)) split features into trusted and sensitive branches, leveraging invariance for robustness against mismatch between training and inference scales. Multi-scale strategies (Noord et al., 2016), which combine variant and invariant features, further improve generalization in practice.

7. Outlook and Current Research Directions

Recent work seeks to generalize scale-invariant principles to more complex architectural patterns, group actions, and training objectives, including equivariant normalization, unsupervised representation learning, multi-task settings, generative models, and physics-informed neural networks. Thermodynamic analogies (Sadrtdinov et al., 10 Nov 2025) offer new perspectives on hyperparameter tuning and ensemble strategies. Ongoing research addresses computational trade-offs, extension to high-dimensional scientific data, and unifying frameworks for invariance under arbitrary continuous groups.