LayerLock: Non-collapsing Representation Learning with Progressive Freezing (2509.10156v2)

Published 12 Sep 2025 in cs.CV

Abstract: We introduce LayerLock, a simple yet effective approach for self-supervised visual representation learning, that gradually transitions from pixel to latent prediction through progressive layer freezing. First, we make the observation that during training of video masked-autoencoding (MAE) models, ViT layers converge in the order of their depth: shallower layers converge early, deeper layers converge late. We then show that this observation can be exploited to accelerate standard MAE by progressively freezing the model according to an explicit schedule, throughout training. Furthermore, this same schedule can be used in a simple and scalable approach to latent prediction that does not suffer from "representation collapse". We apply our proposed approach, LayerLock, to large models of up to 4B parameters with results surpassing those of non-latent masked prediction on the 4DS perception suite.

Summary

The paper demonstrates that progressive freezing mitigates representation collapse while reducing computational overhead and memory usage.
The methodology employs dynamic target evolution, initially predicting pixels and later shifting to deeper latent representations.
Empirical results reveal up to a 19% FLOPs reduction and a 2–3% accuracy boost in action recognition benchmarks without sacrificing performance.

LayerLock: Non-collapsing Representation Learning with Progressive Freezing

Introduction and Motivation

LayerLock introduces a progressive freezing paradigm for self-supervised visual representation learning, specifically targeting video masked auto-encoding (MAE) models with Vision Transformer (ViT) backbones. The central observation motivating LayerLock is that ViT layers converge in order of their depth during training: shallow layers stabilize early, while deeper layers require more iterations. This ordered convergence enables a training strategy where layers are progressively frozen as they converge, reducing computational overhead and memory usage without sacrificing performance.

Figure 1: In video masked auto-encoding, network layers converge during training in order of their depth, as measured by final loss after freezing up to layer $L$ at step $T$ .

LayerLock further extends this idea by dynamically evolving the prediction target throughout training. Initially, the model predicts low-level features (pixels or early layer activations), and as layers are frozen, the prediction target transitions to deeper latent representations. This approach combines the stability of pixel prediction with the semantic richness of latent prediction, while avoiding representation collapse—a common failure mode in latent-based self-supervised learning.

Methodology

LayerLock is implemented within the standard MAE framework, where a masked video clip is encoded and the model reconstructs the masked regions. The key innovation is the progressive freezing schedule and dynamic target selection:

Progressive Freezing: Layers are frozen according to a schedule based on their convergence, determined by monitoring layer norm plateaus or decay. Once frozen, layers are excluded from the backward pass, reducing compute and memory requirements.
Dynamic Target Evolution: The model starts by predicting pixels. After a set number of steps, the first layer is frozen and the target switches to the output of that layer. This process repeats, with each freezing event shifting the prediction target to the next deeper layer.
Figure 2: Proposed learning paradigm: training transitions from pixel prediction (no frozen layers) to shallow latent prediction (first layer frozen) to deeper latent prediction (first two layers frozen), with decoding latents $z$ appended for flexibility.

The architecture leverages a ViT backbone, patchifying input videos and applying novel 3D rotary positional embeddings (RoPE) to encode spatiotemporal position. Decoding latents are concatenated before the final Transformer blocks, and a patch-wise linear decoder projects tokens to the target space (pixels or latent activations).

A simple L2 loss is used between the predicted and target representations. For efficiency, latent loss can be computed on a subset of patches, with ablations showing that even 5% patch selection maintains strong performance.

Efficiency and Performance

LayerLock achieves substantial efficiency gains over vanilla MAE by reducing the number of layers involved in the backward pass as training progresses. Empirical results demonstrate:

Compute Savings: Up to 19% reduction in total FLOPs for long training schedules.
Memory Savings: Up to 16% reduction in peak memory usage.
Performance Preservation: No significant loss in downstream task accuracy or depth estimation error compared to unfrozen baselines.
Figure 3: Progressive freezing of layers saves on total training cost and peak memory utilization without loss in performance; accuracy and error metrics are maintained.

Empirical Results

LayerLock was evaluated on large-scale video datasets (174M web videos, 1B training examples) and compared against strong pixel (4DS MAE) and latent (V-JEPA) prediction baselines. Key findings include:

Action Recognition (SSv2, Kinetics700): LayerLock models consistently outperform baselines, with improvements of 2–3% in Top-1 accuracy.
Depth Estimation (ScanNet): LayerLock maintains or slightly improves relative error compared to baselines.
Generalization: LayerLock is effective for both pixel and latent prediction paradigms, demonstrating its versatility.

Ablation studies confirm the necessity of progressive freezing to avoid representation collapse when introducing latent losses. The use of 3D RoPE embeddings further boosts performance, and single-target prediction (latest frozen layer) is sufficient, simplifying implementation.

Implementation Considerations

Freezing Schedule: Optimal schedules involve gradual freezing (e.g., freeze one layer every 10K steps after an initial pixel prediction phase), with earlier layers frozen first.
Latent Loss Patch Selection: Computing latent loss on a subset of patches (e.g., 5%) offers a trade-off between efficiency and performance, with minimal degradation for semantic tasks.
Positional Embeddings: 3D RoPE applied after the first normalization layer in ViT blocks yields superior results compared to standard attention-based RoPE.
Warmup Strategy: Mini learning rate warmups when switching targets improve stability and final accuracy.
Scaling: LayerLock is demonstrated on models up to 4B parameters and is compatible with large-scale distributed training (256 TPUs).

Theoretical and Practical Implications

LayerLock provides a principled approach to non-collapsing representation learning by leveraging the natural convergence order of deep networks. The method bridges pixel and latent prediction, enabling stable training of large models without the need for architectural asymmetry or auxiliary target encoders. The efficiency gains in compute and memory are particularly relevant for scaling to longer video sequences, higher resolutions, and deeper architectures.

The progressive freezing paradigm may have broader applicability in other domains where layerwise convergence is observed, such as NLP or multimodal transformers. The dynamic target evolution strategy could inform future work on curriculum learning and adaptive self-supervision.

Conclusion

LayerLock introduces a progressive freezing and dynamic target prediction strategy for self-supervised video representation learning, yielding efficient, stable, and high-performing models. The approach is validated across multiple tasks and model scales, with extensive ablations confirming its robustness and generality. Future research may explore more granular freezing schedules, application to other modalities, and integration with advanced readout architectures for further gains in representation quality and efficiency.