Papers
Topics
Authors
Recent
Search
2000 character limit reached

CritiCore Module: Quantifying DNN Module Criticality

Updated 3 January 2026
  • CritiCore Module is a complexity measure for DNNs that quantifies the essentiality of individual network modules by analyzing the loss trajectory from initialization to convergence.
  • It computes module criticality by examining permissible parameter deviations and local flatness, utilizing grid search and noise perturbation to assess training accuracy impact.
  • The approach establishes formal PAC–Bayesian generalization bounds, distinguishing critical modules and outclassing traditional norm-based or spectral metrics.

The CritiCore Module, or module criticality, is a complexity measure for deep neural networks (DNNs) that quantifies the extent to which specific network modules are essential to network performance. It operates by probing the geometry of the loss landscape along the parameter trajectory between initialization and convergence for each module, incorporating both the permissible deviation from initialization (distance) and the local flatness (robustness to noise). CritiCore module criticality serves as both an explanatory and predictive tool, demarcating “critical” modules—whose rewinding to initialization substantially harms training accuracy—from “non-critical” modules, with a formal connection to generalization bounds via PAC–Bayes theory (Chatterji et al., 2019).

1. Mathematical Definition

Let a DNN be structured as a directed acyclic graph with dd modules, each indexed by ii and parameterized by θiRki\theta_i \in \mathbb{R}^{k_i}. Given random initialization θi0\theta_i^0 and trained value θif\theta_i^f for module ii, define for any αi[0,1]\alpha_i \in [0,1] the interpolated weights:

θi(αi):=(1αi)θi0+αiθif\theta_i(\alpha_i) := (1 - \alpha_i)\theta_i^0 + \alpha_i\theta_i^f

and introduce a perturbation uiN(0,σi2Iki)u_i \sim \mathcal{N}(0, \sigma_i^2 I_{k_i}). Construct fΘα+Uf_{\Theta^{\alpha} + U} as the network that uses the perturbed parameters for the iith module and keeps all others at their trained values. For an error tolerance ϵ>0\epsilon > 0, the module criticality is:

μi(fΘ)=min0αi1 σi0{αi2θifθi0F2σi2:Eui[LS(fΘα+U)]LS(fΘf)+ϵ}\mu_i(f_\Theta) = \min_{\substack{0 \leq \alpha_i \leq 1 \ \sigma_i \geq 0}} \left\{ \frac{\alpha_i^2 \|\theta_i^f - \theta_i^0\|_F^2}{\sigma_i^2} : \mathbb{E}_{u_i}\left[L_S(f_{\Theta^\alpha + U})\right] \leq L_S(f_{\Theta^f}) + \epsilon \right\}

where LSL_S is the empirical (0–1) loss on the training set. The network-wide criticality is the sum μ(fΘ)=i=1dμi(fΘ)\mu(f_\Theta) = \sum_{i=1}^d \mu_i(f_\Theta).

The criticality (μi\mu_i) is minimized when the valley in loss between θi0\theta_i^0 and θif\theta_i^f is both long (large αi\alpha_i achievable without losing accuracy) and flat (tolerant to noise, large σi\sigma_i allowed).

2. Practical Computation Procedure

Given a trained network and error tolerance ϵ\epsilon, CritiCore computation proceeds as follows for each module:

  • Select a discrete grid of αi[0,1]\alpha_i \in [0,1] (e.g., {0,0.1,...,1.0}\{0, 0.1, ..., 1.0\}).
  • For each αi\alpha_i, construct θi(αi)\theta_i(\alpha_i).
  • For this αi\alpha_i, maximize σi\sigma_i such that when adding uiN(0,σi2I)u_i \sim \mathcal{N}(0, \sigma_i^2I) to θi(αi)\theta_i(\alpha_i):
    • The empirical loss on the training set remains at most LS(fΘf)+ϵL_S(f_{\Theta^f}) + \epsilon, estimated over several samples (TT typically 5–10).
    • This is commonly solved by bisection search on σi\sigma_i.
  • For each candidate (αi,σi)(\alpha_i, \sigma_i), compute R(αi,σi)=αi2θifθi0F2/σi2R(\alpha_i,\sigma_i) = \alpha_i^2 \|\theta_i^f - \theta_i^0\|_F^2 / \sigma_i^2. Set μi\mu_i to the minimum achieved RR over all candidates.
  • Sum across modules to obtain μ(fΘ)\mu(f_\Theta).

Common discretization choices are coarse α\alpha-grids and logarithmic σ\sigma search. A small threshold ϵ\epsilon (e.g., $0.01$) is recommended to ensure only slight degradation in training performance.

3. Formal Connection to Generalization

CritiCore module criticality admits a PAC–Bayesian generalization bound. For chosen αi\alpha_i and variance σi2\sigma_i^2, consider the posterior with each module's weights centered at θi(αi)\theta_i(\alpha_i) and perturbed by Gaussian noise. The following bound holds (Theorem 3.1 (Chatterji et al., 2019)):

EU[LD(fΘα+U)]EU[LS(fΘα+U)]+14ikilog[1+αi2θifθi0F2kiσi2]+log(m/δ)+O(1)m1\mathbb{E}_U[L_D(f_{\Theta^{\alpha}+U})] \leq \mathbb{E}_U[L_S(f_{\Theta^{\alpha}+U})] + \sqrt{ \frac{ \frac{1}{4} \sum_i k_i \log\left[1 + \frac{\alpha_i^2\|\theta_i^f - \theta_i^0\|_F^2}{k_i \sigma_i^2}\right] + \log(m/\delta) + O(1) }{m-1} }

Optimizing over (αi\alpha_i, σi\sigma_i), subject to the empirical loss constraint, yields (Corollary 3.2) a bound on the true risk at the trained weights:

LD(fΘf)ϵ+14μ(fΘ)+log(m/δ)+O(1)m1L_D(f_{\Theta^f}) \leq \epsilon + \sqrt{ \frac{ \frac{1}{4}\mu(f_\Theta) + \log(m/\delta) + O(1) }{m-1} }

Thus, smaller network criticality μ(fΘ)\mu(f_\Theta) leads to a tighter upper bound on test error, aligning criticality directly with generalization ability.

4. Key Empirical Findings

Empirical evaluation on ResNet-18 trained on CIFAR-10 highlights the following:

  • Rewinding experiments at module level: Certain modules (“Stage2.block1.conv2”) exhibit large increases in training error when reset to initialization—deemed “critical”—while most modules can be reset with negligible effect, hence “non-critical.”
  • Loss landscape geometry: Loss plotted along both the interpolation (trained to initialized) and noise directions displays that non-critical modules possess a wide, flat valley to initialization, while critical modules' valleys narrow near initialization, indicating greater sensitivity.
  • Failure of prior measures: Singular-value spectra, norms such as θfθ0\|\theta^f - \theta^0\|, and CKA-based activation similarity do not distinguish critical from non-critical modules. Importantly, measures based purely on endpoint distances or single-point flatness miss these nuances.
  • Correlation with generalization: The architecture ranking by CritiCore (μ(fΘ)\mu(f_\Theta)) achieves higher Kendall τ\tau correlation with the observed generalization gap compared to norm-based or conventional PAC–Bayesian measures.

5. Comparison to Prior Complexity Measures

A summary contrasting CritiCore module criticality with previously proposed network complexity metrics:

Measure Limitation Relative to Criticality
Product of Frobenius norms Only measures magnitude, not valley shape
Product of spectral norms Ignores sensitivity/flatness
Distance to initialization Endpoint-only, misses flatness
Sum of spectral-norm-distances Linearized, fails to rank critical modules
PAC–Bayes at zero noise Flatness at one point, not distance toward initialization

CritiCore uniquely combines the extent of displacement from initialization (how far) with local flatness (how robust); it aggregates both the normed distance and the tolerable noise, yielding discrimination where earlier approaches do not.

6. Algorithmic Implementation

A high-level pseudocode (verbatim from (Chatterji et al., 2019)) for empirical module criticality computation is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
μ = 0
for i in 1..d:
    best_R = +
    Δ = norm_F(θᵢᶠ  θᵢ)
    for α in α_grid:  # e.g. α_grid = [0,0.1,…,1]
        θ̄ᵢ = (1α)*θᵢ + α*θᵢᶠ
        # Find max σ where noise leaves train-error <= orig + ε
        σ_max = binary_search_on_σ(low=0, high=σ_upper):
            for trial in 1..T:  # T ~ 5–10
                u ~ N(0, σ^2 I)
                Θ_test = Θᶠ; Θ_test[i] = θ̄ᵢ + u
                accs[trial] = train_error(f_{Θ_test}, S)
            if mean(accs) <= orig_error + ε: OK σ else too_large
        R = (α^2 * Δ^2) / (σ_max^2 + tiny)
        best_R = min(best_R, R)
    μᵢ = best_R
    μ += μᵢ
return {μᵢ}, μ
Typical choices include ϵ0.01\epsilon \sim 0.01, coarse α\alpha-grids, logarithmic σ\sigma search, and T=5T=5–$10$ samples for estimation. Refinement can target promising α\alpha regions.

7. Significance and Interpretive Summary

CritiCore module criticality provides a rigorous, operationally meaningful measure that probes the loss-valley structure between initialization and final weights, considering both permissible parameter drift without significant accuracy loss and the flatness of the loss around the traversed path. As it connects directly to PAC–Bayesian generalization bounds and delineates which modules fundamentally influence generalization, CritiCore unifies explanatory and predictive roles, outperforming linear or norm-based measures that ignore valley geometry or inter-module heterogeneity (Chatterji et al., 2019). This framework thus addresses both “why these modules matter” and “which architectures generalize better,” with empirical validation and efficient computational algorithms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CritiCore Module.