Stacked Latent Clustering Network

Updated 31 May 2026

Stacked Latent Clustering (SLC) network is a hierarchical self-supervised framework that iteratively predicts and clusters latent representations using modular predictor-clusterer units.
The architecture coarsens representations through repeated patch extraction and local prediction-clustering modules, achieving exponential sample efficiency compared to token-level methods.
Empirical and theoretical analyses confirm that SLC maintains an O(v·m³) sample complexity independent of depth, promoting scalable and data-efficient learning in structured domains.

Stacked Latent Clustering (SLC) networks implement a hierarchy of self-supervised modules that learn latent representations by predicting their own clustered abstractions across multiple levels of compositional structure. The approach, validated both algorithmically and via end-to-end neural instantiations, demonstrates a sample complexity that is exponentially more efficient in model depth compared to token-level prediction, providing theoretical and empirical advantages for learning in structured domains such as the Random Hierarchy Model (RHM) (Korchinski et al., 26 May 2026).

1. Architectural Composition

An SLC network ingests a string of visible tokens $x = (x_1, \ldots, x_{s^L}) \in \mathcal V_0^{s^L}$ , where $L$ is the hierarchical depth, $s$ is the branching factor, and $\mathcal V_0$ is a vocabulary of size $v$ . The core of the architecture is a stack of $(L-1)$ identical “predictor + clusterer” modules. At hierarchical level $\ell$ , each module processes a discrete token sequence

$\widehat h^{(\ell)} = (\widehat h^{(\ell)}_1, \ldots, \widehat h^{(\ell)}_{s^{L-\ell}}) \in \{1, \ldots, d_h\}^{s^{L-\ell}},$

where $d_h$ is the cluster codebook size. The module workflow is:

Patch Extraction: The level- $\ell$ tokens are grouped into non-overlapping $L$ 0-tuples (patches),

$L$ 1

Prediction ("Pred"): Each patch is embedded into one-hot representations and processed by a small CNN+softmax, yielding prediction tensors:

$L$ 2

indexing local position, cousin-patch, within-patch token, and output classes.

Clustering ("Clust"): The prediction tensor is vectorized and processed by another CNN+softmax, resulting in a soft label assignment:

$L$ 3

from which the hard discrete label is extracted via $L$ 4 or Gumbel-softmax.

The outputs $L$ 5 become the input tokens for the next module, iteratively coarsening representations.

2. Learning Objectives and Local Losses

Each SLC module is trained independently with two distinct local objectives:

Prediction loss $L$ 6 employs token masking within each cousin patch, optimizing cross-entropy to reconstruct masked entries. For every sample, patch, and cousin position, tokens are masked and the predictor attempts to infer the correct one-hot encoding, with the formal objective:

$L$ 7

with $L$ 8, $L$ 9, and $s$ 0 as specified indices.

Clustering loss $s$ 1 drives the collapse of synonymous vectors and separation of unrelated ones. Formally,

$s$ 2

where $s$ 3 is the cosine similarity, $s$ 4 a margin, and the $s$ 5's are hyperparameters for separation and sparsity.

The sum $s$ 6 forms the total local loss. By default, gradients are not propagated between modules, enabling module-wise locality reminiscent of biological computation.

3. Iterative Latent Clustering Algorithm

The SLC framework is closely related to the Iterative Latent Clustering (ILC) algorithm, which hierarchically clusters prediction vectors at each level. Sample pseudocode for ILC appears as:

$\ell$ 8

This iterative approach is non-parametric and matches the neural SLC in sample complexity for recovering latent hierarchies.

4. Sample Complexity Analysis

For a balanced and separated RHM grammar, and assuming the stability of the clustering oracle, the required number of samples $s$ 7 to recover all non-root latents is

$s$ 8

where $s$ 9 is the number of grammar rules per parent and $\mathcal V_0$ 0 is the vocabulary size. This is independent of depth $\mathcal V_0$ 1. The reasoning is:

At level 0, resolving token–token synonymy requires $\mathcal V_0$ 2 for concentration.
Once a layer is decoded, it induces a new RHM of the same form, so $\mathcal V_0$ 3 does not increase with $\mathcal V_0$ 4.
A union bound over levels confirms the result.

By contrast, masked-token SSL on the RHM requires $\mathcal V_0$ 5 samples to recover the deepest latents, highlighting the exponential gain conferred by stacked latent prediction (Korchinski et al., 26 May 2026).

5. Neural End-to-End Implementation

An end-to-end SLC network instantiates each predictor and clusterer as differentiable modules:

Predictor ("Pred"): Three 1D convolutional layers (stride $\mathcal V_0$ 6 then 1, with ReLU and batch norm), culminating in a softmax over $\mathcal V_0$ 7 classes for each patch position. Output shape: $\mathcal V_0$ 8.
Clusterer ("Clust"): Two 1D conv layers mapping the flattened prediction tensor to $\mathcal V_0$ 9-dimensional softmax outputs.
Teacher–Student EMA: Both Pred and Clust maintain exponential moving average teacher copies with update

$v$ 0

enabling the student to predict the teacher’s outputs and avoid representational collapse.

Typical optimization parameters: learning rate $v$ 1 (AdamW), weight decay $v$ 2, batch size 32, hidden channels $v$ 3, cluster codebook $v$ 4, contrastive margin $v$ 5, separation $v$ 6, sparsity $v$ 7.

6. Empirical Demonstration

Empirical studies were conducted on RHM instances with various branching factors ( $v$ 8), vocabulary sizes ( $v$ 9), and rule counts ( $(L-1)$ 0):

ILC (k-means): For $(L-1)$ 1, recovery accuracy curves of all non-root latents as $(L-1)$ 2 increases collapse when rescaled by $(L-1)$ 3, sharpening at $(L-1)$ 4 for all levels.
End-to-End SLC: For $(L-1)$ 5, the SLC stack is pretrained with local losses. A linear probe on the final-layer representation accurately predicts the top-level latent $(L-1)$ 6. After rescaling $(L-1)$ 7, probe accuracy curves for various $(L-1)$ 8 collapse to a common threshold, saturating the $(L-1)$ 9 bound.
Depth Independence: Experiments across $\ell$ 0 reveal negligible shift in the threshold $\ell$ 1, confirming that the sample complexity remains $\ell$ 2 as depth increases.
Comparison to Token-level SSL: Token-prediction objectives require $\ell$ 3 samples to reconstruct deep latents, exhibiting no scaling collapse under $\ell$ 4 normalization, unlike the SLC and ILC curves which align perfectly under the $\ell$ 5 rescaling.

Collectively, these findings establish that SLC and its iterative analog recover deep hierarchical latents with sharply reduced and $\ell$ 6-independent sample complexity.

7. Context and Implications

SLC exemplifies a hierarchical self-supervised learning paradigm centered on latent prediction rather than label or token prediction. The theoretical and empirical results indicate that explicit stacking of local predictor-clusterer modules is sufficient to saturate the $\ell$ 7 sample complexity bound, obviating the need for deeper or more global prediction strategies. This sample efficiency contrasts starkly with the exponential sample requirements of token-level prediction schemes—a distinction with implications for the study of data-efficient learning in both artificial and biological systems (Korchinski et al., 26 May 2026).

A plausible implication is that latent-prediction protocols underpinning recent models such as data2vec and JEPA inherently induce hierarchical latent clustering, and that further explicit stacking, as in H-JEPA, may offer limited additional benefit beyond what is achievable by local SLC-style modules. The SLC architecture offers a principled template for scalable, modular, and data-efficient inference in structured generative domains.

Markdown Report Issue Upgrade to Chat

References (1)

Learn from your own latents and not from tokens: A sample-complexity theory (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stacked Latent Clustering (SLC) Network.

Stacked Latent Clustering Network

1. Architectural Composition

2. Learning Objectives and Local Losses

3. Iterative Latent Clustering Algorithm

4. Sample Complexity Analysis

5. Neural End-to-End Implementation

6. Empirical Demonstration

7. Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Stacked Latent Clustering Network

1. Architectural Composition

2. Learning Objectives and Local Losses

3. Iterative Latent Clustering Algorithm

4. Sample Complexity Analysis

5. Neural End-to-End Implementation

6. Empirical Demonstration

7. Context and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research