Scale-Depth Asymmetric Dependency

Updated 27 November 2025

Scale-Depth Asymmetric Dependency is a unidirectional relationship where scale influences depth-based mechanisms, critical in quantile estimation, depth inference, and neural architectures.
It reveals that changes in scale dictate resource allocation, accuracy, and representational capacity while increasing depth alone cannot reverse scale-induced effects.
Empirical studies show that using asymmetric scale functions leads to efficiency gains and optimized phase transitions in models, underscoring the importance of decoupling scale and depth.

Scale-Depth Asymmetric Dependency refers to a class of phenomena in machine learning and statistical modeling in which the relationship between “scale” (either as a hyperparameter, data property, or representational granularity) and “depth” (model layers, algorithmic steps, or feature hierarchy) is fundamentally unidirectional or asymmetric. Scale influences depth-based mechanisms or outcomes, but there is no symmetric or reciprocal relationship from depth to scale, or the mapping is highly non-linear and not invertible. This concept is reflected across mixture models, neural network training, quantile data structures, multi-scale vision systems, and depth estimation pipelines, and it frequently underpins architectural design choices, inference stability, and representational capacity in large models.

1. Theoretical Underpinnings of Scale-Depth Asymmetry

Scale-depth asymmetry arises when scale modulates a system’s behavior in a manner that cannot be inverted by manipulating only the depth. In monocular depth estimation, for instance, a predicted depth map $\hat D(I)$ is ambiguous up to an unknown global scale $s>0$ , as both $\hat D(I)$ and $s\,\hat D(I)$ yield the same loss under scale-invariant supervision. This is a classic asymmetric dependency: only supplemental scale information resolves the ambiguity, and there is no reverse mechanism to infer $s$ from $\hat D(I)$ alone (Guizilini et al., 2023, Wei et al., 2023).

In Bayesian linear neural networks, depth induces a data-dependent scale-mixture over Gaussian process predictors. As the number of layers grows, the posterior over “scales” $\sigma$ collapses, making the inference increasingly insensitive to variations in the input scale, but depth itself cannot restore lost scale diversity once this collapse occurs (Zavatone-Veth et al., 2021).

In quantile estimation structures such as $t$ -digests, the choice of scale function—especially when made asymmetric—determines accuracy allocation along the quantile axis, but changes to depth (i.e., cluster partitioning) do not invert the effect: only the scale function selection dictates which regions retain fine-grained accuracy (Ross, 2020).

2. Asymmetric Scale Functions and Quantile Estimation

The $t$ -digest structure computes approximate quantiles using clusters with sizes constrained by a monotonic scale function $k(q)$ . Classic scale functions ( $k_0$ linear, $k_1$ arcsine) are symmetric about the median, resulting in equal accuracy in both tails (Ross, 2020).

The tangent-line (“glued”) construction introduces intentional asymmetry: $k_{\mathrm{asym}}(q) = \begin{cases} k'_r(p)(q-p) + k_r(p), & 0 \leq q \leq p \ k_r(q), & p < q \leq 1 \end{cases}$ with $p$ a split point (e.g., $p=0.5$ ), $k_r$ the reference (symmetric) scale. This design concentrates accuracy in a chosen quantile region (upper or lower tail), at the cost of degrading it in the opposing region. Depth—number of clusters or their allocation—cannot “undo” or symmetrically compensate for this; only the scale function shape and split point control the trade-off.

Empirically, such an asymmetric scale reduces memory requirements by up to 60% (median centroid count drops from 2150 to 1130 for AVL-tree digests) with no loss in critical tail accuracy but clear coarsening elsewhere (Ross, 2020). The asymmetry is non-invertible: increasing depth cannot recover lost accuracy in the relaxed scale region, confirming the fundamental dependency of tail performance on scale function and not partition depth.

3. Multi-Scale and Selective Depth Attention in Neural Models

Selective Depth Attention (SDA) modules in deep feature extractors—e.g., SDA- $x$ Net—attend over the depth axis to combine features from blocks with different receptive fields, fusing information across a stage. The depth attention weights are learned through global average pooling on the summed block outputs, passed through a bottleneck and then a softmax, resulting in scale-conditioned, depth-wise attention (Guo et al., 2022).

Empirical analysis across input scales shows clear monotonic asymmetry: as the spatial scale increases (inputs enlarged), deeper block features dominate attention; as the scale decreases (inputs shrunk), shallow block features dominate. The shift in attention is strictly one-way: changes in depth modulate which receptive fields are selected, but input scale dictates the allocation, not the reverse.

The table below summarizes the observed directional effect:

Scale Regime	Dominant Blocks	Attention Shift
Small scale	Shallow (low $i$ )	$s_{i \rightarrow m} \downarrow$
Large scale	Deep (high $i$ )	$s_{1 \rightarrow m} \uparrow$

This confirms the scale-depth asymmetric dependency: scale selection directs depth allocation, but not vice versa (Guo et al., 2022).

4. Asymmetric Dependencies in Monocular Depth Estimation

The ill-posed nature of monocular depth inference results not only from scale ambiguity (unknown scene size) but also from focal-ambiguity (unknown camera intrinsics). FS-Depth injects the focal length $f$ as a feature at multiple resolutions, enabling the depth decoder to adjust predictions according to scale and focal length (Wei et al., 2023). The model architecture routes this information only from scale/focal to depth; there is no reciprocal path from depth estimates back to the input scale or focal.

ZeroDepth further disambiguates metric scale via geometric pixel-ray embeddings and a variational latent code. This decoupling ensures that absolute scene geometry is grounded in observable scale signals, rather than being inferred solely from ambiguous depth cues (Guizilini et al., 2023).

In DesNet for depth completion, the absolute depth is factored as $\mathbf D(x) = \alpha\, \mathbf d(x)$ , with $\mathbf d(x)$ unit-interval “relative” depth and a global scale $\alpha$ . Optimization and representation improve by learning these factors separately (theoretically, this strictly enlarges the solution space and reduces training loss), but crucially, the system is asymmetric: $\alpha$ scales all predictions but cannot be inferred purely from predicted $\mathbf d(x)$ (Yan et al., 2022).

5. Bayesian and Statistical Models: Scale Mixtures and Depth

In overparameterized deep linear Bayesian NNs, the posterior predictive for scalar output is a scale mixture of GPs: $p(\hat f|\mathcal{D}) = \int_0^\infty \mathcal{N}(\hat f; 0, \sigma^2 K) \, d\mu_{\text{depth}}(\sigma)$ where $\mu_{\text{depth}}$ is a depth-dependent mixing measure. As depth increases, $\mu_{\text{depth}}$ concentrates near small $\sigma$ , suppressing large-scale variance. Depth thus “averages out” scale diversity—a one-way, non-invertible mapping—as finite-depth models can flexibly interpolate over scales, but infinite depth removes this flexibility (Zavatone-Veth et al., 2021).

Design implications include selecting moderate depth for maximal scale-adaptivity in inference. Once depth is high enough, further increases do not recover flexibility, so the scale-depth relationship is unidirectionally suppressed by depth.

6. Scaling Laws and Asymmetry in Depth vs. Width

In deep self-attention networks optimized for in-context regression, risk scaling exhibits regime-dependent depth and width asymmetry. In the randomly rotated structured (RRS) regime, the risk decomposes as: $R_{\text{RRS}}(N, L) \simeq c_N N^{-\nu\beta} + c_L L^{-\beta}$ with exponents set by feature decay ( $\nu$ ) and label-source ( $\beta$ ) (Bordelon et al., 1 Oct 2025). Notably, depth governs performance via $\beta$ alone (not combined exponents), while width depends on $\nu\beta$ , reflecting an asymmetric scaling effect. The “hardness” of scaling with respect to depth versus width flips at $\nu=1$ , further emphasizing non-reciprocal dependencies as a function of data statistics and model structure.

Optimal depth–width tradeoffs require calibrating where increased depth actually yields diminishing returns versus where width scaling is more efficient—again reflecting unbalanced, data-dependent coupling.

7. Representation Dynamics and Scale-Depth Asymmetry in Vision Transformers

Analysis of Vision Transformer (ViT) scaling reveals non-monotonic interactions between model “scale” (ViT-S, ViT-B, ViT-L), depth, and task performance (Kumar, 26 Nov 2025). All observed models pass through a “Cliff–Plateau–Climb” trajectory in representation dynamics, but the critical “pivot” where information mixing optimally balances task and structure occurs at different depths for each architecture: ViT-S pivots at layer 9, ViT-B at 8, ViT-L only at 18. Increasing depth in the largest models drives the optimal transition later—and can degrade geometrically interpretable structures (e.g., simplex ETF in neural collapse metrics, ISI), rather than monotonically improving performance.

This reveals a scale–depth asymmetric dependency: model scale dictates the “phase structure” of depth evolution, while depth parameter alone cannot restore optimal representational alignment lost by over-scaling. Design implications include calibrating depth to align with structural phase transitions, rather than simply maximizing parameter count.

8. Broader Implications and Practical Applications

Scale-depth asymmetric dependencies unify disparate phenomena in quantile estimation, deep networks, Bayesian inference, attention architectures, and more. System design, optimization, and interpretability depend on understanding these non-invertible mappings. Key principles include:

Explicit decoupling of scale and depth terms to enable targeted optimization (e.g., global scale heads, encoded camera intrinsics).
Asymmetric accuracy allocation via scale-function selection in statistical data structures where depth cannot compensate.
Depth-induced scale collapse in Bayesian inference, requiring moderation for maximal flexibility or compensatory mixture priors.
Monitoring of phase transitions (e.g., ISI, NC metrics) in deep models to prevent over-scrambling and delayed optimization transitions.

A plausible implication is that for many architectures, optimal performance and resource usage require aligning depth with critical scales imposed by data or representational bottlenecks, rather than maximizing either parameter independently. Failure to respect the inherent asymmetry can waste capacity, collapse inference diversity, or degrade interpretability (Ross, 2020, Zavatone-Veth et al., 2021, Guo et al., 2022, Guizilini et al., 2023, Wei et al., 2023, Yan et al., 2022, Bordelon et al., 1 Oct 2025, Kumar, 26 Nov 2025).