Papers
Topics
Authors
Recent
Search
2000 character limit reached

Variational JEPA Modeling

Updated 17 March 2026
  • Variational JEPA is a probabilistic framework that generalizes JEPA by replacing deterministic regression with a variational latent-space predictive distribution.
  • It unifies self-supervised representation learning with predictive state representations and Bayesian filtering to handle high-dimensional, noisy data.
  • Its Bayesian extension, BJEPA, leverages a Product of Experts formulation to enable zero-shot task transfer and constraint satisfaction.

Variational Joint Embedding Predictive Architectures (VJEPA) comprise a probabilistic generalization of Joint Embedding Predictive Architectures, providing a scalable, uncertainty-aware, and likelihood-free framework for world modeling in high-dimensional and noisy environments. VJEPA replaces the deterministic regression objectives of standard JEPA with a variational latent-space predictive distribution, systematically unifying representation learning with Predictive State Representations (PSRs) and Bayesian filtering. Its Bayesian extension, BJEPA, further introduces modularity for constraint satisfaction and zero-shot task transfer through a Product of Experts formulation. This article outlines the core principles, architecture, theoretical guarantees, and empirical properties of VJEPA and BJEPA, emphasizing their roles in robust, uncertainty-aware planning.

1. Architecture and Generative Foundations

VJEPA models sequential data via two key architectural elements: a latent context encoder and a variational predictive head. For a timestep tt:

  • The context encoder fθf_\theta maps the observed trajectory xtx_{\le t} to a latent state st=fθ(xt)s_t = f_\theta(x_{\le t}).
  • The target latent st+k=fθ(xt+k)s_{t+k} = f_{\theta'}(x_{t+k}) is produced by an exponentially moving average (EMA) encoder, which stabilizes training and captures the slow-varying target distribution.
  • Side information ξt+k\xi_{t+k} (actions, time indices, etc.) is optionally supplied for conditional prediction.

The generative process optionally posits: pϕ(st+kst,ξt+k),pψ(xt+kst+k),p_\phi(s_{t+k} \mid s_t, \xi_{t+k}), \qquad p_\psi(x_{t+k} \mid s_{t+k}), resulting in joint dynamics through

pϕ,ψ(st+k,xt+kxt,ξt+k)=pϕ(st+kst,ξt+k)pψ(xt+kst+k).p_{\phi,\psi}(s_{t+k}, x_{t+k} \mid x_{\le t}, \xi_{t+k}) = p_\phi(s_{t+k} \mid s_t, \xi_{t+k})\, p_\psi(x_{t+k} \mid s_{t+k}).

Practical JEPA systems typically omit pψp_\psi and forego pixel-level likelihood modeling, restricting learning to latent space dynamics.

The amortized inference model uses the EMA encoder to define

qθ(st+kxt+k)=N(st+kfθ(xt+k),Σθ(xt+k)),q_{\theta'}(s_{t+k}\mid x_{t+k}) = \mathcal{N}(s_{t+k} \mid f_{\theta'}(x_{t+k}), \Sigma_{\theta'}(x_{t+k})),

establishing a two-head latent prediction design: a fast evolving student and a slow target teacher (Huang, 20 Jan 2026).

2. Variational Objective and Regularization

The primary learning objective in VJEPA is a latent-space ELBO: LVJEPA=Ext:t+k[Est+kqθ(xt+k)[logpϕ(st+kst,ξt+k)]+βDKL(qθ(st+kxt+k)p(s))],\mathcal{L}_\mathrm{VJEPA} = \mathbb{E}_{x_{t:t+k}} \left[ \mathbb{E}_{s_{t+k}\sim q_{\theta'}(\cdot|x_{t+k})} [-\log p_\phi(s_{t+k}|s_t, \xi_{t+k})] + \beta\, D_\mathrm{KL}\left(q_{\theta'}(s_{t+k}|x_{t+k})\,\|\,p(s)\right) \right], with p(s)=N(0,I)p(s)=\mathcal{N}(0,I) as the latent prior and β>0\beta>0. The predictive log-likelihood term encourages the model to concentrate probability mass on realistic targets, while the KL regularizer prevents the encoder from collapsing onto degenerate representations by penalizing arbitrarily small distributional variance. For Gaussian latent transitions,

pϕ(st+kst,ξt+k)=N(st+kμϕ(st,ξt+k),Σϕ(st)),p_\phi(s_{t+k} | s_t, \xi_{t+k}) = \mathcal{N}(s_{t+k} \mid \mu_\phi(s_t, \xi_{t+k}), \Sigma_\phi(s_t)),

the log-likelihood reduces to a mean squared error plus a log-determinant covariance penalty. Collapse avoidance is further substantiated by a theorem showing that globally collapsed encoders incur strictly higher loss for any nontrivial target diversity and nontrivial conditional structure (Huang, 20 Jan 2026).

3. Connection to Predictive State Representations and Bayesian Filtering

VJEPA provides a finite-dimensional variational approximation to PSRs, in which the latent state sts_t compactly encodes the conditional distribution over future observations given history: bt()=p(ot+1:t+Hot)N(μϕ(st),Σϕ(st)).b_t(\cdot) = p(o_{t+1:t+H} \mid o_{\leq t}) \approx \mathcal{N}(\mu_\phi(s_t), \Sigma_\phi(s_t)). Belief propagation is achieved via

p(st+Δ)=pϕ(st+Δst,ξt+Δ)p(st)dstp(s_{t+\Delta}) = \int p_\phi(s_{t+\Delta} \mid s_t, \xi_{t+\Delta})\, p(s_t)\, ds_t

and can be performed through particle sampling and rollouts. This "amortized" predictive modeling framework unifies latent self-supervised representation learning with classical Bayesian filters, obviating the need for explicit observation likelihoods in high-entropy output spaces (Huang, 20 Jan 2026).

4. Predictive Sufficiency and Implications for Control

A latent encoding sts_t is predictively sufficient over horizon HH if, for all potential action sequences ut:t+H1u_{t:t+H-1},

p(st+1:t+Hht,ut:t+H1)=p(st+1:t+Hst,ut:t+H1),p(s_{t+1:t+H} \mid h_t, u_{t:t+H-1}) = p(s_{t+1:t+H} \mid s_t, u_{t:t+H-1}),

where hth_t is the observation-action history. Theoretical results guarantee that, given sufficient expressivity and optimal training, sts_t attains a predictive state property: the optimal control policy π(utht)\pi^*(u_t \mid h_t) can be written as π(utst)\pi^*(u_t \mid s_t). Notably, these conditions render pixel-level generative modeling unnecessary for optimal control or planning; latent beliefs and transitions suffice (Huang, 20 Jan 2026).

5. Bayesian Extension: BJEPA and Product of Experts

BJEPA generalizes VJEPA by partitioning the predictive belief into a learned dynamics expert, plike(st+kst)p_\mathrm{like}(s_{t+k} \mid s_t), and a modular (task, constraint, or goal) prior expert, pprior(st+kη)p_\mathrm{prior}(s_{t+k} \mid \eta), where η\eta encodes external desiderata. Assuming conditional independence: p(st,ηst+k)=p(stst+k)p(ηst+k),p(s_t, \eta \mid s_{t+k}) = p(s_t\mid s_{t+k})\, p(\eta\mid s_{t+k}), the posterior over st+ks_{t+k} factorizes as a Product of Experts (PoE): p(st+kst,η)plike(st+kst)pprior(st+kη).p(s_{t+k} \mid s_t, \eta) \propto p_\mathrm{like}(s_{t+k} \mid s_t) p_\mathrm{prior}(s_{t+k} \mid \eta). For Gaussian experts, the fused posterior remains Gaussian with precision-summed covariance and mean: Σpost1=Σlike1+Σprior1,μpost=Σpost(Σlike1μlike+Σprior1μprior).\Sigma_\mathrm{post}^{-1} = \Sigma_\mathrm{like}^{-1} + \Sigma_\mathrm{prior}^{-1},\quad \mu_\mathrm{post} = \Sigma_\mathrm{post}\, \left(\Sigma_\mathrm{like}^{-1} \mu_\mathrm{like} + \Sigma_\mathrm{prior}^{-1} \mu_\mathrm{prior}\right). This modularity enables zero-shot task transfer and constraint satisfaction, as priors encoding new tasks or physics constraints can be swapped in without retraining the core dynamics (Huang, 20 Jan 2026).

6. Empirical Evaluation in Noisy Environments

Empirical studies use a $20$-dimensional linear-Gaussian system, with a $4$-dimensional informative signal sts_t and a high-variance (×8\times 8) distractor dtd_t (the “Noisy TV” setup). The observation is

xt=Cst+D(σdt)+ϵt,σ{0,1,2,,8}.x_t = C s_t + D(\sigma d_t) + \epsilon_t, \qquad \sigma \in \{0,1,2,\dots,8\}.

VAE and autoregressive (AR) pixel-prediction baselines exhibit collapse (R20.5R^2 \approx 0.5) at high noise, focusing on the distractor. In contrast, JEPA, VJEPA, and BJEPA yield R2>0.85R^2 > 0.85, robustly recovering the true signal. The probabilistic VJEPA architecture enables sampling-based estimation of credible intervals for future latent states, underscoring its capacity for explicit uncertainty quantification (Huang, 20 Jan 2026).

7. Likelihood-Free Modeling and Uncertainty-Aware Planning

VJEPA's learning paradigm is fundamentally likelihood-free with respect to pixel observations. The predictive latent-space approach avoids modeling high-entropy nuisances and instead focuses on compact, informative beliefs for sequential prediction and control. Key properties include:

  • Explicit uncertainty via the predictive distribution pϕ(st+1st)p_\phi(s_{t+1} \mid s_t).
  • Distributional rollouts by sampling ensemble latent trajectories from the predictive model.
  • Risk-sensitive Model Predictive Control (MPC) by evaluating expected cost or Conditional Value-at-Risk (CVaR) in latent space.
  • Theoretical resilience to representation collapse due to the variational ELBO and KL penalty structure.

These features establish VJEPA and BJEPA as foundational methods for robust self-supervised world modeling and planning, supporting uncertainty quantification and scalable downstream robotics or decision-making (Huang, 20 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Variational JEPA (VJEPA).