Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stacked Latent Clustering Network

Updated 31 May 2026
  • Stacked Latent Clustering (SLC) network is a hierarchical self-supervised framework that iteratively predicts and clusters latent representations using modular predictor-clusterer units.
  • The architecture coarsens representations through repeated patch extraction and local prediction-clustering modules, achieving exponential sample efficiency compared to token-level methods.
  • Empirical and theoretical analyses confirm that SLC maintains an O(v·m³) sample complexity independent of depth, promoting scalable and data-efficient learning in structured domains.

Stacked Latent Clustering (SLC) networks implement a hierarchy of self-supervised modules that learn latent representations by predicting their own clustered abstractions across multiple levels of compositional structure. The approach, validated both algorithmically and via end-to-end neural instantiations, demonstrates a sample complexity that is exponentially more efficient in model depth compared to token-level prediction, providing theoretical and empirical advantages for learning in structured domains such as the Random Hierarchy Model (RHM) (Korchinski et al., 26 May 2026).

1. Architectural Composition

An SLC network ingests a string of visible tokens x=(x1,,xsL)V0sLx = (x_1, \ldots, x_{s^L}) \in \mathcal V_0^{s^L}, where LL is the hierarchical depth, ss is the branching factor, and V0\mathcal V_0 is a vocabulary of size vv. The core of the architecture is a stack of (L1)(L-1) identical “predictor + clusterer” modules. At hierarchical level \ell, each module processes a discrete token sequence

h^()=(h^1(),,h^sL()){1,,dh}sL,\widehat h^{(\ell)} = (\widehat h^{(\ell)}_1, \ldots, \widehat h^{(\ell)}_{s^{L-\ell}}) \in \{1, \ldots, d_h\}^{s^{L-\ell}},

where dhd_h is the cluster codebook size. The module workflow is:

  • Patch Extraction: The level-\ell tokens are grouped into non-overlapping LL0-tuples (patches),

LL1

  • Prediction ("Pred"): Each patch is embedded into one-hot representations and processed by a small CNN+softmax, yielding prediction tensors:

LL2

indexing local position, cousin-patch, within-patch token, and output classes.

  • Clustering ("Clust"): The prediction tensor is vectorized and processed by another CNN+softmax, resulting in a soft label assignment:

LL3

from which the hard discrete label is extracted via LL4 or Gumbel-softmax.

The outputs LL5 become the input tokens for the next module, iteratively coarsening representations.

2. Learning Objectives and Local Losses

Each SLC module is trained independently with two distinct local objectives:

  • Prediction loss LL6 employs token masking within each cousin patch, optimizing cross-entropy to reconstruct masked entries. For every sample, patch, and cousin position, tokens are masked and the predictor attempts to infer the correct one-hot encoding, with the formal objective:

LL7

with LL8, LL9, and ss0 as specified indices.

  • Clustering loss ss1 drives the collapse of synonymous vectors and separation of unrelated ones. Formally,

ss2

where ss3 is the cosine similarity, ss4 a margin, and the ss5's are hyperparameters for separation and sparsity.

The sum ss6 forms the total local loss. By default, gradients are not propagated between modules, enabling module-wise locality reminiscent of biological computation.

3. Iterative Latent Clustering Algorithm

The SLC framework is closely related to the Iterative Latent Clustering (ILC) algorithm, which hierarchically clusters prediction vectors at each level. Sample pseudocode for ILC appears as:

\ell8

This iterative approach is non-parametric and matches the neural SLC in sample complexity for recovering latent hierarchies.

4. Sample Complexity Analysis

For a balanced and separated RHM grammar, and assuming the stability of the clustering oracle, the required number of samples ss7 to recover all non-root latents is

ss8

where ss9 is the number of grammar rules per parent and V0\mathcal V_00 is the vocabulary size. This is independent of depth V0\mathcal V_01. The reasoning is:

  • At level 0, resolving token–token synonymy requires V0\mathcal V_02 for concentration.
  • Once a layer is decoded, it induces a new RHM of the same form, so V0\mathcal V_03 does not increase with V0\mathcal V_04.
  • A union bound over levels confirms the result.

By contrast, masked-token SSL on the RHM requires V0\mathcal V_05 samples to recover the deepest latents, highlighting the exponential gain conferred by stacked latent prediction (Korchinski et al., 26 May 2026).

5. Neural End-to-End Implementation

An end-to-end SLC network instantiates each predictor and clusterer as differentiable modules:

  • Predictor ("Pred"): Three 1D convolutional layers (stride V0\mathcal V_06 then 1, with ReLU and batch norm), culminating in a softmax over V0\mathcal V_07 classes for each patch position. Output shape: V0\mathcal V_08.
  • Clusterer ("Clust"): Two 1D conv layers mapping the flattened prediction tensor to V0\mathcal V_09-dimensional softmax outputs.
  • Teacher–Student EMA: Both Pred and Clust maintain exponential moving average teacher copies with update

vv0

enabling the student to predict the teacher’s outputs and avoid representational collapse.

Typical optimization parameters: learning rate vv1 (AdamW), weight decay vv2, batch size 32, hidden channels vv3, cluster codebook vv4, contrastive margin vv5, separation vv6, sparsity vv7.

6. Empirical Demonstration

Empirical studies were conducted on RHM instances with various branching factors (vv8), vocabulary sizes (vv9), and rule counts ((L1)(L-1)0):

  • ILC (k-means): For (L1)(L-1)1, recovery accuracy curves of all non-root latents as (L1)(L-1)2 increases collapse when rescaled by (L1)(L-1)3, sharpening at (L1)(L-1)4 for all levels.
  • End-to-End SLC: For (L1)(L-1)5, the SLC stack is pretrained with local losses. A linear probe on the final-layer representation accurately predicts the top-level latent (L1)(L-1)6. After rescaling (L1)(L-1)7, probe accuracy curves for various (L1)(L-1)8 collapse to a common threshold, saturating the (L1)(L-1)9 bound.
  • Depth Independence: Experiments across \ell0 reveal negligible shift in the threshold \ell1, confirming that the sample complexity remains \ell2 as depth increases.
  • Comparison to Token-level SSL: Token-prediction objectives require \ell3 samples to reconstruct deep latents, exhibiting no scaling collapse under \ell4 normalization, unlike the SLC and ILC curves which align perfectly under the \ell5 rescaling.

Collectively, these findings establish that SLC and its iterative analog recover deep hierarchical latents with sharply reduced and \ell6-independent sample complexity.

7. Context and Implications

SLC exemplifies a hierarchical self-supervised learning paradigm centered on latent prediction rather than label or token prediction. The theoretical and empirical results indicate that explicit stacking of local predictor-clusterer modules is sufficient to saturate the \ell7 sample complexity bound, obviating the need for deeper or more global prediction strategies. This sample efficiency contrasts starkly with the exponential sample requirements of token-level prediction schemes—a distinction with implications for the study of data-efficient learning in both artificial and biological systems (Korchinski et al., 26 May 2026).

A plausible implication is that latent-prediction protocols underpinning recent models such as data2vec and JEPA inherently induce hierarchical latent clustering, and that further explicit stacking, as in H-JEPA, may offer limited additional benefit beyond what is achievable by local SLC-style modules. The SLC architecture offers a principled template for scalable, modular, and data-efficient inference in structured generative domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Stacked Latent Clustering (SLC) Network.