Explain the quadratic dependence on sparsity in CNN sample complexity

Derive a theoretical explanation for why the sample complexity P*_CNN of Convolutional Neural Networks trained to learn the Sparse Random Hierarchy Model scales quadratically with the sparsity parameter (s0+1), despite each shared weight being connected to a fraction of the input that is independent of the hierarchy depth L. Establish principled conditions under which P*_CNN ∝ (s0+1)^2 n_c m^L arises from weight sharing and spatial sparsity of informative features in the SRHM.

Background

The paper empirically finds distinct sample complexity scalings for Locally Connected Networks (LCNs) and Convolutional Neural Networks (CNNs) on the Sparse Random Hierarchy Model (SRHM). For LCNs, P* scales as (s0+1)L n_c mL, while for CNNs it scales as (s0+1)2 n_c mL, suggesting a substantial benefit of weight sharing under sparsity.

While a heuristic argument is presented that justifies the exponential dependence on (s0+1) for LCNs via reduced local informative signal frequency, the corresponding quadratic scaling for CNNs remains unexplained within the paper. Clarifying the mechanism behind this quadratic dependence is necessary to complete the theoretical understanding of how weight sharing interacts with spatial sparsity in hierarchical generative tasks.

References

Qualitatively, the same scenario holds for CNNs. One expects a different sample complexity since each weight is now connected to a fraction of the input that is independent of $L$. Yet, the quadratic dependence on $s_0+1$ remains to be understood.

How Deep Networks Learn Sparse and Hierarchical Data: the Sparse Random Hierarchy Model (2404.10727 - Tomasini et al., 16 Apr 2024) in Section 6, Sample complexities arguments