CritiCore Module: Quantifying DNN Module Criticality
- CritiCore Module is a complexity measure for DNNs that quantifies the essentiality of individual network modules by analyzing the loss trajectory from initialization to convergence.
- It computes module criticality by examining permissible parameter deviations and local flatness, utilizing grid search and noise perturbation to assess training accuracy impact.
- The approach establishes formal PAC–Bayesian generalization bounds, distinguishing critical modules and outclassing traditional norm-based or spectral metrics.
The CritiCore Module, or module criticality, is a complexity measure for deep neural networks (DNNs) that quantifies the extent to which specific network modules are essential to network performance. It operates by probing the geometry of the loss landscape along the parameter trajectory between initialization and convergence for each module, incorporating both the permissible deviation from initialization (distance) and the local flatness (robustness to noise). CritiCore module criticality serves as both an explanatory and predictive tool, demarcating “critical” modules—whose rewinding to initialization substantially harms training accuracy—from “non-critical” modules, with a formal connection to generalization bounds via PAC–Bayes theory (Chatterji et al., 2019).
1. Mathematical Definition
Let a DNN be structured as a directed acyclic graph with modules, each indexed by and parameterized by . Given random initialization and trained value for module , define for any the interpolated weights:
and introduce a perturbation . Construct as the network that uses the perturbed parameters for the th module and keeps all others at their trained values. For an error tolerance , the module criticality is:
where is the empirical (0–1) loss on the training set. The network-wide criticality is the sum .
The criticality () is minimized when the valley in loss between and is both long (large achievable without losing accuracy) and flat (tolerant to noise, large allowed).
2. Practical Computation Procedure
Given a trained network and error tolerance , CritiCore computation proceeds as follows for each module:
- Select a discrete grid of (e.g., ).
- For each , construct .
- For this , maximize such that when adding to :
- The empirical loss on the training set remains at most , estimated over several samples ( typically 5–10).
- This is commonly solved by bisection search on .
- For each candidate , compute . Set to the minimum achieved over all candidates.
- Sum across modules to obtain .
Common discretization choices are coarse -grids and logarithmic search. A small threshold (e.g., $0.01$) is recommended to ensure only slight degradation in training performance.
3. Formal Connection to Generalization
CritiCore module criticality admits a PAC–Bayesian generalization bound. For chosen and variance , consider the posterior with each module's weights centered at and perturbed by Gaussian noise. The following bound holds (Theorem 3.1 (Chatterji et al., 2019)):
Optimizing over (, ), subject to the empirical loss constraint, yields (Corollary 3.2) a bound on the true risk at the trained weights:
Thus, smaller network criticality leads to a tighter upper bound on test error, aligning criticality directly with generalization ability.
4. Key Empirical Findings
Empirical evaluation on ResNet-18 trained on CIFAR-10 highlights the following:
- Rewinding experiments at module level: Certain modules (“Stage2.block1.conv2”) exhibit large increases in training error when reset to initialization—deemed “critical”—while most modules can be reset with negligible effect, hence “non-critical.”
- Loss landscape geometry: Loss plotted along both the interpolation (trained to initialized) and noise directions displays that non-critical modules possess a wide, flat valley to initialization, while critical modules' valleys narrow near initialization, indicating greater sensitivity.
- Failure of prior measures: Singular-value spectra, norms such as , and CKA-based activation similarity do not distinguish critical from non-critical modules. Importantly, measures based purely on endpoint distances or single-point flatness miss these nuances.
- Correlation with generalization: The architecture ranking by CritiCore () achieves higher Kendall correlation with the observed generalization gap compared to norm-based or conventional PAC–Bayesian measures.
5. Comparison to Prior Complexity Measures
A summary contrasting CritiCore module criticality with previously proposed network complexity metrics:
| Measure | Limitation Relative to Criticality |
|---|---|
| Product of Frobenius norms | Only measures magnitude, not valley shape |
| Product of spectral norms | Ignores sensitivity/flatness |
| Distance to initialization | Endpoint-only, misses flatness |
| Sum of spectral-norm-distances | Linearized, fails to rank critical modules |
| PAC–Bayes at zero noise | Flatness at one point, not distance toward initialization |
CritiCore uniquely combines the extent of displacement from initialization (how far) with local flatness (how robust); it aggregates both the normed distance and the tolerable noise, yielding discrimination where earlier approaches do not.
6. Algorithmic Implementation
A high-level pseudocode (verbatim from (Chatterji et al., 2019)) for empirical module criticality computation is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
μ = 0 for i in 1..d: best_R = +∞ Δ = norm_F(θᵢᶠ – θᵢ⁰) for α in α_grid: # e.g. α_grid = [0,0.1,…,1] θ̄ᵢ = (1–α)*θᵢ⁰ + α*θᵢᶠ # Find max σ where noise leaves train-error <= orig + ε σ_max = binary_search_on_σ(low=0, high=σ_upper): for trial in 1..T: # T ~ 5–10 u ~ N(0, σ^2 I) Θ_test = Θᶠ; Θ_test[i] = θ̄ᵢ + u accs[trial] = train_error(f_{Θ_test}, S) if mean(accs) <= orig_error + ε: OK σ else too_large R = (α^2 * Δ^2) / (σ_max^2 + tiny) best_R = min(best_R, R) μᵢ = best_R μ += μᵢ return {μᵢ}, μ |
7. Significance and Interpretive Summary
CritiCore module criticality provides a rigorous, operationally meaningful measure that probes the loss-valley structure between initialization and final weights, considering both permissible parameter drift without significant accuracy loss and the flatness of the loss around the traversed path. As it connects directly to PAC–Bayesian generalization bounds and delineates which modules fundamentally influence generalization, CritiCore unifies explanatory and predictive roles, outperforming linear or norm-based measures that ignore valley geometry or inter-module heterogeneity (Chatterji et al., 2019). This framework thus addresses both “why these modules matter” and “which architectures generalize better,” with empirical validation and efficient computational algorithms.