Papers
Topics
Authors
Recent
2000 character limit reached

SimCLR-based JEPA: Predictive World Modeling

Updated 22 December 2025
  • The paper introduces SimCLR-JEPA, a reconstruction-free framework that aligns predicted and observed latent representations using a temporal InfoNCE loss.
  • It employs a dual-module design with an encoder and a GRU-based predictor, eliminating the projection head to optimize instance-wise contrastive loss.
  • Empirical evaluations reveal sub-pixel accuracy with changing distractors, but a collapse in representation when faced with static, persistent distractors.

SimCLR-based Joint Embedding Predictive Architecture (SimCLR-JEPA) is a reconstruction-free framework for self-supervised world model learning that leverages a temporal contrastive objective to align predicted and observed latent representations. In contrast to generative models trained with pixel-level reconstruction loss, SimCLR-JEPA avoids decoding and directly optimizes an instance-wise contrastive loss across temporally adjacent embeddings. Its primary distinguishing feature is its reliance on slow feature learning, which both enables robust performance under certain types of background variation and leads to a notable failure mode in the presence of temporally persistent distractors (Sobal et al., 2022).

1. Architectural Design

SimCLR-JEPA employs a dual-module structure:

  • Encoder gϕg_\phi: Implements three convolutional layers (channels 32→64→64; activation functions: ReLU and BatchNorm), followed by 2×22\times2 average pooling and a linear projection to a d=512d=512-dimensional latent space. This module converts each input image otRH×Wo_t \in \mathbb{R}^{H\times W} into an instantaneous latent state st=gϕ(ot)Rds_t = g_\phi(o_t) \in \mathbb{R}^d.
  • Predictor (Forward Model) fθf_\theta: Consists of a one-layer GRU with hidden size 512. For each time step, it consumes the previous latent and action at1R2a_{t-1} \in \mathbb{R}^2, generating a one-step latent prediction st~=fθ(s~t1,at1)\tilde{s_t} = f_\theta(\tilde{s}_{t-1}, a_{t-1}). The prediction sequence initializes with s~1=s1\tilde{s}_1 = s_1. Notably, the architecture omits the standard SimCLR MLP projection head; s~t\tilde s_t serves as the second view in the contrastive framework.

2. Loss Function: SimCLR InfoNCE in JEPA

SimCLR-JEPA adapts the SimCLR InfoNCE loss to a predictive setting. At each time step t=2,...,T+1t = 2, ..., T+1, two views are produced:

  • st=gϕ(ot)s_t = g_\phi(o_t) (“encoder view”)
  • s~t=fθ(st1,at1)\tilde s_t = f_\theta(s_{t-1}, a_{t-1}) (“predictor view”)

The instancewise symmetric InfoNCE loss, applied with temperature τ>0\tau > 0, is defined as: expsim(u,v)=exp(uvτuv)\mathrm{expsim}(u, v) = \exp \left( \frac{u^\top v}{\tau \|u\| \|v\|} \right)

InfoNCE(St,S~t)=1Ni=1Nlogexpsim(st,i,s~t,i)k=1N[expsim(st,i,s~t,k)+1kiexpsim(st,i,st,k)]\mathrm{InfoNCE}(S_t, \tilde S_t) = -\frac{1}{N} \sum_{i=1}^N \log \frac{\mathrm{expsim}(s_{t,i}, \tilde s_{t,i})} {\sum_{k=1}^N \left[ \mathrm{expsim}(s_{t,i}, \tilde s_{t,k}) + \mathbf{1}_{k \neq i} \mathrm{expsim}(s_{t,i}, s_{t,k}) \right]}

The total symmetric objective, averaged over TT time steps, is: LSimCLR=12Tt=2T+1[InfoNCE(St,S~t)+InfoNCE(S~t,St)]\mathcal{L}_\text{SimCLR} = \frac{1}{2T} \sum_{t=2}^{T+1} [\mathrm{InfoNCE}(S_t, \tilde S_t) + \mathrm{InfoNCE}(\tilde S_t, S_t)] Positive pairs consist of time-aligned embedding and one-step predicted latent; negatives are all other embeddings from the same batch and time step.

3. Training Protocol and Environment

Training operates fully offline and without reward information. The setup incorporates:

  • Dataset: One million pre-collected trajectories of the format {o1,a1,o2,...,aT,oT+1}\{o_1, a_1, o_2, ..., a_T, o_{T+1}\}, used exclusively for pretraining.
  • Simulated Environment: Each episode is a 17-frame sequence in which a single Gaussian-blurred dot traverses a [0,1]2[0,1]^2 square. Actions are sampled as atUniform[0,D]×VonMises(μ,κ)a_t \sim \text{Uniform}[0,D] \times \text{VonMises}(\mu, \kappa) with D=0.14D=0.14.
  • Distractor Generation: Backgrounds consist of either “uniform” (pixel values Uniform[0,1]\sim \text{Uniform}[0,1]) or “structured” (random CIFAR-10 images) noise. Distractors can be:
    • “Changing”: independently resampled each frame,
    • “Fixed”: held constant within an episode, resampled between episodes.
  • A brightness coefficient α[0,1.5]\alpha \in [0,1.5] regulates distractor visibility.
  • No classical data augmentation is applied; all nondeterministic variation arises from distractor generation.

4. Empirical Performance and Evaluation

Representational fidelity is assessed via a linear “prober” trained on the (frozen) latent sequence s~t\tilde s_t to regress the true dot position ctR2c_t \in \mathbb{R}^2, reporting RMSE averaged across 17 frames.

Performance under changing background noise:

  • SimCLR-JEPA and VICReg-JEPA achieve sub-pixel accuracy (RMSE 0.020.06\approx 0.02–0.06) as α\alpha increases up to $1.5$, without hyperparameter retuning.
  • Pixel-level reconstruction baselines degrade (RMSE 0.2\approx 0.2) at high α\alpha if not tuned.

Performance under fixed background noise:

  • SimCLR-JEPA and VICReg-JEPA experience representational collapse: RMSE increases to 0.170.28\approx 0.17–0.28 at α=1.5\alpha=1.5, similar to random guessing.
  • In contrast, reconstruction-based models remain robust (RMSE 0.070.11\approx 0.07–0.11 for α1.5\alpha \leq 1.5).

<table> <thead> <tr> <th>Condition</th> <th>SimCLR-JEPA RMSE</th> <th>Reconstruction RMSE</th> </tr> </thead> <tbody> <tr> <td>Fixed Uniform (α=1.0)</td> <td>≈0.19</td> <td>≈0.07</td> </tr> <tr> <td>Changing Uniform (α=1.0)</td> <td>≈0.03</td> <td>≈0.20</td> </tr> </tbody> </table>

This empirical evidence shows SimCLR-JEPA is highly robust against unpredictable, temporally varying distractors but fails catastrophically when distractors are temporally persistent.

5. Theoretical Analysis: Failure Mode under Fixed Distractors

SimCLR-JEPA’s contrastive predictive objective selects for “slow features” (features constant or slowly varying with time). With a fixed episode-wise distractor ZZ, which is constant within an episode but independent between episodes, the architecture admits a trivial solution: gϕ(ot)=sN(0,σ2I),fθ(s,a)=sg_\phi(o_t) = s \sim \mathcal{N}(0, \sigma^2 I), \quad f_\theta(s, a) = s With st=st+1=ss_t = s_{t+1} = s for all tt, positive pairs align and the InfoNCE loss is minimized. The resulting embeddings are uniform on the hypersphere (in the large negative sample limit) and thus satisfy the contrastive uniformity condition (cf. Wang & Isola 2022, Thm 1). The construction achieves LSimCLR0\mathcal{L}_\text{SimCLR}\to 0 without encoding any task-relevant rapid feature (the moving dot). A similar collapse argument applies to VICReg, as the prediction, variance, and covariance losses also vanish under this solution. This outcome demonstrates that SimCLR-JEPA cannot distinguish between static and dynamic content in the absence of additional constraints.

6. Practical Strengths, Weaknesses, and Research Implications

Strengths of SimCLR-JEPA:

  • Reconstruction-free: Circumvents blurry pixel-MSE losses and can ignore highly unpredictable (changing) distractors without retuning.
  • Simplicity: Employs only an encoder and a forward model—decoder is unnecessary.
  • Hyperparameter robustness: The temperature parameter τ\tau generalizes well across noise intensity, provided distractors are non-stationary.

Weaknesses:

  • Susceptibility to static slow features: When any static or slow-varying background is present, the learned representation collapses onto those spurious distractors, neglecting dynamic, task-relevant features.
  • Blindness to source dynamics: The architecture cannot inherently distinguish between dynamic and static sources; this limitation requires additional mechanisms, such as hierarchical time-scale modeling or penalties that discourage constant outputs.

Implications for world-model research:

Contrastive-predictive approaches such as SimCLR-JEPA show robustness to changing distractors and promise for pretraining in visually complex environments. However, prioritization of slow features results in catastrophic failure in the presence of persistent distractors, representing a fundamental Achilles’ heel. Mitigating this failure is likely to necessitate architectural inductive biases (e.g., hierarchical JEPA), enhanced objectives penalizing over-focus on static features, or alternative input regimes (e.g., frame-differences, optical flow) that circumvent the encoding of fixed backgrounds (Sobal et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SimCLR-based JEPA.