Stacked Latent Clustering Network
- Stacked Latent Clustering (SLC) network is a hierarchical self-supervised framework that iteratively predicts and clusters latent representations using modular predictor-clusterer units.
- The architecture coarsens representations through repeated patch extraction and local prediction-clustering modules, achieving exponential sample efficiency compared to token-level methods.
- Empirical and theoretical analyses confirm that SLC maintains an O(v·m³) sample complexity independent of depth, promoting scalable and data-efficient learning in structured domains.
Stacked Latent Clustering (SLC) networks implement a hierarchy of self-supervised modules that learn latent representations by predicting their own clustered abstractions across multiple levels of compositional structure. The approach, validated both algorithmically and via end-to-end neural instantiations, demonstrates a sample complexity that is exponentially more efficient in model depth compared to token-level prediction, providing theoretical and empirical advantages for learning in structured domains such as the Random Hierarchy Model (RHM) (Korchinski et al., 26 May 2026).
1. Architectural Composition
An SLC network ingests a string of visible tokens , where is the hierarchical depth, is the branching factor, and is a vocabulary of size . The core of the architecture is a stack of identical “predictor + clusterer” modules. At hierarchical level , each module processes a discrete token sequence
where is the cluster codebook size. The module workflow is:
- Patch Extraction: The level- tokens are grouped into non-overlapping 0-tuples (patches),
1
- Prediction ("Pred"): Each patch is embedded into one-hot representations and processed by a small CNN+softmax, yielding prediction tensors:
2
indexing local position, cousin-patch, within-patch token, and output classes.
- Clustering ("Clust"): The prediction tensor is vectorized and processed by another CNN+softmax, resulting in a soft label assignment:
3
from which the hard discrete label is extracted via 4 or Gumbel-softmax.
The outputs 5 become the input tokens for the next module, iteratively coarsening representations.
2. Learning Objectives and Local Losses
Each SLC module is trained independently with two distinct local objectives:
- Prediction loss 6 employs token masking within each cousin patch, optimizing cross-entropy to reconstruct masked entries. For every sample, patch, and cousin position, tokens are masked and the predictor attempts to infer the correct one-hot encoding, with the formal objective:
7
with 8, 9, and 0 as specified indices.
- Clustering loss 1 drives the collapse of synonymous vectors and separation of unrelated ones. Formally,
2
where 3 is the cosine similarity, 4 a margin, and the 5's are hyperparameters for separation and sparsity.
The sum 6 forms the total local loss. By default, gradients are not propagated between modules, enabling module-wise locality reminiscent of biological computation.
3. Iterative Latent Clustering Algorithm
The SLC framework is closely related to the Iterative Latent Clustering (ILC) algorithm, which hierarchically clusters prediction vectors at each level. Sample pseudocode for ILC appears as:
8
This iterative approach is non-parametric and matches the neural SLC in sample complexity for recovering latent hierarchies.
4. Sample Complexity Analysis
For a balanced and separated RHM grammar, and assuming the stability of the clustering oracle, the required number of samples 7 to recover all non-root latents is
8
where 9 is the number of grammar rules per parent and 0 is the vocabulary size. This is independent of depth 1. The reasoning is:
- At level 0, resolving token–token synonymy requires 2 for concentration.
- Once a layer is decoded, it induces a new RHM of the same form, so 3 does not increase with 4.
- A union bound over levels confirms the result.
By contrast, masked-token SSL on the RHM requires 5 samples to recover the deepest latents, highlighting the exponential gain conferred by stacked latent prediction (Korchinski et al., 26 May 2026).
5. Neural End-to-End Implementation
An end-to-end SLC network instantiates each predictor and clusterer as differentiable modules:
- Predictor ("Pred"): Three 1D convolutional layers (stride 6 then 1, with ReLU and batch norm), culminating in a softmax over 7 classes for each patch position. Output shape: 8.
- Clusterer ("Clust"): Two 1D conv layers mapping the flattened prediction tensor to 9-dimensional softmax outputs.
- Teacher–Student EMA: Both Pred and Clust maintain exponential moving average teacher copies with update
0
enabling the student to predict the teacher’s outputs and avoid representational collapse.
Typical optimization parameters: learning rate 1 (AdamW), weight decay 2, batch size 32, hidden channels 3, cluster codebook 4, contrastive margin 5, separation 6, sparsity 7.
6. Empirical Demonstration
Empirical studies were conducted on RHM instances with various branching factors (8), vocabulary sizes (9), and rule counts (0):
- ILC (k-means): For 1, recovery accuracy curves of all non-root latents as 2 increases collapse when rescaled by 3, sharpening at 4 for all levels.
- End-to-End SLC: For 5, the SLC stack is pretrained with local losses. A linear probe on the final-layer representation accurately predicts the top-level latent 6. After rescaling 7, probe accuracy curves for various 8 collapse to a common threshold, saturating the 9 bound.
- Depth Independence: Experiments across 0 reveal negligible shift in the threshold 1, confirming that the sample complexity remains 2 as depth increases.
- Comparison to Token-level SSL: Token-prediction objectives require 3 samples to reconstruct deep latents, exhibiting no scaling collapse under 4 normalization, unlike the SLC and ILC curves which align perfectly under the 5 rescaling.
Collectively, these findings establish that SLC and its iterative analog recover deep hierarchical latents with sharply reduced and 6-independent sample complexity.
7. Context and Implications
SLC exemplifies a hierarchical self-supervised learning paradigm centered on latent prediction rather than label or token prediction. The theoretical and empirical results indicate that explicit stacking of local predictor-clusterer modules is sufficient to saturate the 7 sample complexity bound, obviating the need for deeper or more global prediction strategies. This sample efficiency contrasts starkly with the exponential sample requirements of token-level prediction schemes—a distinction with implications for the study of data-efficient learning in both artificial and biological systems (Korchinski et al., 26 May 2026).
A plausible implication is that latent-prediction protocols underpinning recent models such as data2vec and JEPA inherently induce hierarchical latent clustering, and that further explicit stacking, as in H-JEPA, may offer limited additional benefit beyond what is achievable by local SLC-style modules. The SLC architecture offers a principled template for scalable, modular, and data-efficient inference in structured generative domains.