Sub-JEPA: Subspace Gaussian Regularization for Stable End-to-End World Models
Abstract: Joint-Embedding Predictive Architectures (JEPAs) provide a simpleframework for learning world models by predicting future latent representations.However, JEPA training is subject to a bias-variance tradeoff.Without sufficient structural constraints, excessive representationalvariance causes the model to collapse to trivial solutions.The recent LeWorldModel (LeWM) shows that this issue can be alleviated bysimply constraining latent embeddings with an isotropic Gaussian prior.However, latent representations inherently lie on low-dimensional manifoldswithin a high-dimensional ambient space, and enforcing an isotropic Gaussianprior directly in this ambient space introduces an overly strong bias.In this work, we propose ame, which seeks a favorable operatingpoint on the bias-variance frontier by applying Gaussian constraints inmultiple random subspaces rather than in the originalembedding space.This design relaxes the global constraint while preserving itsanti-collapse effect, leading to a better balance between trainingstability and representation flexibility.Extensive experiments across fourcontinuous-control environments demonstrate that consistentlyoutperforms LeWM with very clear margins.Our method is simple yet effective, and serves as a strong baseline for future JEPA-based world model research.fdefinedeeemodeThe code is available at https://github.com/intcomp/Sub-JEPA.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces a new way to train โworld models,โ which are AI systems that learn how the world changes over time so they can plan actions. The new method is called Sub-JEPA. It helps the model learn stable, useful โhidden codesโ (called latent representations) without getting stuck in bad, boring solutions. The key trick is to gently shape these hidden codes inside many small slices (subspaces) instead of squeezing the whole code all at once.
What questions did the researchers ask?
The researchers focused on three simple questions:
- How can we stop world models from โcollapsing,โ where the hidden codes all become almost the same and useless?
- How can we keep training stable without making the model too stiff or too simple?
- Can we get better planning performance by applying a softer, smarter constraint that matches the true complexity of the task?
How does the method work? (Simple explanation with analogies)
First, a quick picture of a world model:
- An encoder turns each image of the world into a compact hidden code (like a secret summary).
- A predictor uses the current code plus the action (what you plan to do) to guess the next code (what happens next).
The problem:
- If we just train the predictor to match the next code, the encoder can cheat by mapping everything to nearly the same code. That makes the prediction easy but destroys useful information. This is called collapse.
A common fix (the earlier LeWorldModel idea):
- Force the hidden codes to โlook likeโ a standard bell-shaped (Gaussian) distribution in the full high-dimensional space. This prevents collapse but can be too strictโlike squashing a complex shape evenly from all directions.
Sub-JEPAโs idea:
- Instead of squeezing the whole hidden code at once, slice it into many small, non-overlapping views (subspaces) using fixed, orthogonal โwindowsโ (think of shining multiple perfectly angled flashlights on a sculpture).
- In each small subspace, gently encourage the numbers to look bell-shaped. This keeps the anti-collapse protection but relaxes the pressure overall, letting the model keep the shapes that matter for the task.
A few everyday analogies help:
- Full-space squeezing: like pressing a balloon equally in every directionโit might lose interesting patterns.
- Subspace squeezing: like checking a big sculpture from many angles and making sure each view looks tidy, without flattening the sculpture.
Important details made simple:
- The โsubspacesโ come from random, orthogonal projections (non-overlapping, balanced views). They are frozen (not trained) so the rules donโt move during learning.
- The training loss = predict-the-next-code loss + a gentle โbe bell-shaped in each subspaceโ loss.
- The number of subspaces (K) controls how small each slice is. More slices = more flexibility, but if slices get too tiny, the checks become unreliable.
What did they test, and how?
They trained and evaluated on four control tasks using only raw images:
- Two-Room: a simple 2D navigation task.
- Reacher: moving a two-link arm to a target.
- PushT: pushing a block on a table.
- OGB-Cube: a visually complex 3D manipulation task.
They compared Sub-JEPA to:
- LeWorldModel (LeWM): the earlier โfull-spaceโ Gaussian method.
- PLDM: a model that uses several hand-tuned training tricks.
- DINO-WM: a model that uses a powerful frozen vision encoder (pretrained elsewhere).
They also ran careful tests to understand why Sub-JEPA works:
- Effective rank: how many directions in the hidden code are meaningfully used (lower can mean a cleaner, more task-matched code).
- Ablations: changing the number and size of subspaces, and whether projections are orthogonal/frozen or trainable.
- Probes: checking if physical facts (like positions and angles) can be decoded from the hidden code.
- Geometry checks: are the learned โpathsโ over time smooth and straight, which helps planning?
Main findings and why they matter
Here are the key results and their importance:
- Better planning across tasks: Sub-JEPA beat LeWM in all four environments, often by clear margins. The biggest jump came in Two-Room, where the task is truly low-dimensional, so the softer subspace constraint helped the most.
- Cleaner hidden codes: Sub-JEPA reduced the โeffective rankโ of the embeddings in a way that matched each taskโs needs. Bigger rank reductions lined up with bigger planning gains. This means the model focused on the important factors and ignored noisy extras.
- Right level of constraint: Using more subspaces usually helped, but making each subspace too tiny could hurt (especially in PushT). This shows a balance: relax enough to be flexible, but keep each slice big enough to be reliable.
- Orthogonal and frozen projections are best: Keeping the subspace โwindowsโ fixed and non-overlapping gave the most stable results. Letting them train made the constraint weaker over time.
- Keeps physical meaning: On PushT, the hidden codes still contained valuable physical information (like positions and angles), often more easily recovered by a small neural probe. Linear decoding of angles was slightly worse in one case, but a small non-linear probe closed the gap.
- Smoother, more stable dynamics: The hidden paths over time were straighter and long-horizon predictions drifted less, which is crucial for planning over many steps.
Why does this subspace idea help?
Many control problems really depend on a few key factors (like position and velocity), even if the images are high-dimensional. Forcing a full high-dimensional bell shape everywhere is too strong; it bends the code away from the taskโs true shape. By applying the โbe bell-shapedโ rule only within several smaller, orthogonal slices, the model keeps the anti-collapse benefit while better matching the taskโs natural simplicity.
What could this change going forward?
- Stronger, simpler baselines: Sub-JEPA is easy to implement, yet improves stability and performance. Itโs a solid starting point for future world-model research.
- Less need for heavy pretraining: It narrows the gap with methods that depend on large pretrained vision models, enabling more end-to-end learning directly from pixels.
- Better planning: Smoother latent dynamics and more compact representations should help real robots and agents plan further and more reliably.
- A general lesson: Matching the strength of your regularization to the taskโs true complexity (using subspaces) can boost both stability and flexibilityโuseful beyond world models, in other self-supervised learning setups too.
In short, Sub-JEPA shows that a small changeโregularizing in many small, fixed slices instead of the whole spaceโcan make end-to-end world models both steadier and smarter.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of what remains missing, uncertain, or unexplored in the paper, framed to be concrete and actionable for future research.
- Lack of theoretical guarantees for subspace Gaussian regularization: formalize when and why imposing Gaussianity in multiple low-dimensional orthogonal subspaces prevents collapse and improves prediction risk, and characterize the induced global latent distribution.
- Relationship between subspace parameters and learning dynamics: derive analytic links between K (number of subspaces), (subspace dimension), (number of directions), batch/sequence lengths, and the bias-variance tradeoff in training stability and representation richness.
- Global distribution induced by subspace-wise constraints: quantify whether independent subspace Gaussian constraints imply any global isotropy/ellipticity in the full latent space, and under what conditions (if any) they approximate the LeJEPA optimal prior.
- Principled, data-driven selection of K and : develop procedures to estimate intrinsic dimensionality online (e.g., via effective rank, spectral decay, or participation ratio) and adapt K/ during training to match the evolving latent manifold.
- Sensitivity to hyperparameters of the regularizer: systematically assess the impact of the regularization weight , the number of EP directions , and batch/sequence sizes on stability, performance, and collapse risk; provide tuning guidelines or auto-tuning schemes.
- Reliability of the EppsโPulley normality loss in low dimensions: quantify statistical power and gradient stability of the EP statistic for small ; identify thresholds below which the signal becomes unreliable and propose mitigations (e.g., multi-dimensional tests).
- Alternative normality objectives: compare EP-based loss to sliced Wasserstein, energy distance, ShapiroโWilk, or multivariate tests (e.g., HenzeโZirkler) in terms of stability, computational cost, and performance under subspace constraints.
- Computational overhead and scalability: measure and report the added training time and memory cost of per-subspace EP losses (scaling with KรMรNรB), and propose efficient implementations (e.g., structured projections, quasi-Monte Carlo directions, sharing statistics).
- Generalization across domains: evaluate Sub-JEPA on a broader range of tasks (e.g., discrete control, partial observability, multi-agent, real robotics), including real-world sensor noise and domain shifts, to assess robustness and applicability.
- Comparison to stronger end-to-end baselines: include state-of-the-art latent world models (e.g., DreamerV3, IRIS) under matched protocols to contextualize gains; explore hybridization with contrastive/non-contrastive anti-collapse methods (VICReg, Barlow Twins).
- Interplay with planner choice: quantify sensitivity of performance to different planning algorithms (MPC variants, CEM, gradient-based planning), horizons, and control frequencies; assess whether subspace regularization preferentially benefits certain planners.
- Scaling with embedding dimension D and architecture: ablate D, encoder/predictor capacity, and architectural choices to understand whether gains persist at larger scales or different network families.
- Online and reward-driven training: test Sub-JEPA in on-policy/off-policy RL loops with rewards, distribution shift, and exploration, examining stability under non-stationary data and its effect on sample efficiency.
- Adaptive or data-driven projections: investigate learned projections with constraints that prevent co-adaptation (e.g., spectral regularization, limited update frequency, mutual information control) or data-driven bases (PCA/CCA on latent statistics) while preserving orthogonality/isometry.
- Cross-subspace interactions: analyze whether independent subspace constraints inadvertently fragment task-relevant structure; measure cross-subspace correlations and propose coupling terms if beneficial.
- Treatment of periodic/rotational variables: design subspace-specific priors for angular quantities (e.g., von Mises or wrapped Gaussians) to address the observed decrease in linear decodability of rotations, and evaluate across tasks.
- Physical decodability beyond PushT: replicate probing across all environments, expand the set of physical variables (velocities, contact states, forces), and analyze per-subspace decodability to interpret what information each subspace captures.
- Quantitative long-horizon rollout evaluation across tasks: extend open-loop analyses beyond Two-Room with standardized metrics (e.g., cumulative prediction error, drift, consistency) to substantiate claims about improved temporal coherence.
- Full-space normality and geometry diagnostics: complement effective rank with full-space normality tests, covariance anisotropy measures, geodesic curvature/straightness, and manifold topology diagnostics to more precisely link geometry to planning gains.
- Robustness to dataset size and quality: characterize sample efficiency, performance under limited or noisy offline data, and sensitivity to replay buffer composition; propose curriculum or augmentation strategies aligned with subspace constraints.
- Regularization scheduling: explore annealing or curriculum strategies for , K, , and during training to transition from anti-collapse to flexibility as representations mature.
- Structured randomness in projections: compare random orthonormal matrices to fast transforms (Hadamard, FFT-based) and block-diagonal structures for computational efficiency and controlled distortion.
- Uncertainty quantification and safety: investigate whether subspace-wise Gaussian constraints improve calibration, uncertainty estimates, and safe planning under model error; integrate with risk-sensitive MPC.
- Reproducibility and resource reporting: provide detailed training budgets, hardware, and runtime statistics for Sub-JEPA vs. baselines, and assess sensitivity to seeds beyond six to strengthen claims of stability.
- Hybrid priors and mixtures: examine anisotropic or mixture-of-Gaussians priors per subspace to better match heterogeneous latent factors, and study how such priors interact with JEPAโs predictive objective.
Practical Applications
Immediate Applications
Below are concrete uses that can be deployed now, leveraging the paperโs method (Sub-JEPA) and training recipe, the released codebase, and standard model-predictive-control (MPC) planners.
- Stable latent world models for robotics planning from pixels
- Sector: Robotics (manufacturing, logistics, lab automation)
- What: Replace LeWM/JEPA regularization with Sub-JEPA to reduce collapse while preserving task-relevant low-dimensional structure, improving MPC-based planning success in navigation, reaching, and manipulation.
- Tools/workflows: Integrate the Sub-JEPA regularizer from the provided GitHub repo; keep orthogonal frozen projections; tune K and ds via small validation sweeps; monitor effective rank and temporal path straightness; plug into existing MPC planners.
- Assumptions/dependencies: Sufficient offline trajectories (images+actions); vision-based control setup; MPC/planning stack available; compute to train encoders; tasks with relatively low intrinsic dynamics dimensionality benefit most.
- Retrofitting existing pixel-based control pipelines
- Sector: Consumer and service robotics (vacuums, mobile bases, simple pick-and-place arms)
- What: Swap the regularizer in current JEPA/LeWM-like pipelines to immediately gain more coherent latent trajectories and longer-horizon rollout stability, reducing drift.
- Tools/workflows: Minimal code changeโdrop-in regularizer; reuse training schedules and optimizers; keep projections frozen and orthonormal; sanity-check with open-loop rollout visualizations.
- Assumptions/dependencies: Existing JEPA-style backbone; ability to retrain with available logs; testing in the target domain to validate safety/robustness.
- Offline, reward-free pretraining for goal-conditioned planning
- Sector: Robotics, autonomy R&D
- What: Train Sub-JEPA on logged state-action trajectories without rewards to enable zero-shot or few-shot planning via latent rollouts, reducing reliance on pretraining (e.g., DINO) and complex multi-term losses.
- Tools/workflows: Curate logs; train Sub-JEPA with prediction+subspace Gaussian losses; deploy MPC or CEM over latent dynamics; evaluate with success-rate metrics.
- Assumptions/dependencies: Quality and diversity of logs; planner quality; task goals must be expressible in latent space.
- โAnti-collapseโ module for JEPA-style self-supervised learning
- Sector: Computer vision/software
- What: Use Sub-JEPAโs subspace Gaussian constraint as a drop-in anti-collapse regularizer for JEPA variants in images/video, avoiding large-batch contrastive setups.
- Tools/workflows: Add subspace projections and EppsโPulley normality test per subspace; tune K to balance bias-variance; track embedding effective rank.
- Assumptions/dependencies: Latent representations suitable for projected normality tests; compute overhead for multiple subspace and direction tests.
- Training diagnostics and monitoring
- Sector: ML Ops/Research
- What: Adopt effective rank and temporal path straightness as lightweight health signals for latent geometry during world-model training.
- Tools/workflows: Periodically compute effective rank on held-out samples; log straightness curves; set alert thresholds; perform early stopping or K adjustments.
- Assumptions/dependencies: Access to representative held-out data; simple post-hoc analytics pipeline.
- Simulated environments and game AI planning
- Sector: Gaming, simulation, embodied AI benchmarks
- What: Use Sub-JEPA for stable world models that support latent-space planning from pixels in simulators (navigation, reaching, pushing).
- Tools/workflows: Integrate into existing RL/planning pipelines; leverage code repo; run ablations on K to match task complexity.
- Assumptions/dependencies: Visual observations dominate; actions are continuous or can be embedded; planners in place.
- Academic baseline for JEPA world-model research
- Sector: Academia
- What: Employ Sub-JEPA as a strong, simple baseline for future JEPA-based world model studies and ablation suites.
- Tools/workflows: Reproduce reported environments; extend to new datasets; compare regularization strategies; publish diagnostics (rank, straightness).
- Assumptions/dependencies: Familiarity with JEPA/LeWM frameworks; compute resources for training.
- Vision-based drone or mobile robot navigation prototypes
- Sector: UAVs, AMRs
- What: Improve latent rollout coherence for short- to mid-horizon planning from onboard cameras in controlled environments.
- Tools/workflows: Train Sub-JEPA on flight/drive logs; plug into MPC; validate in indoor labs before field tests.
- Assumptions/dependencies: Strict safety protocols; limited domain shift; reliance on good action logging and calibration.
Long-Term Applications
The following opportunities require additional research, scaling, domain adaptation, and/or safety validation beyond the paperโs scope.
- Generalist robot world models trained from diverse logs
- Sector: Robotics (multi-task assistants, warehouse generalists)
- What: Scale Sub-JEPA to multi-task, multi-environment datasets to learn broadly applicable latent dynamics models that plan across tasks without reconstruction.
- Potential products: Generalist latent world-model backbones; planning-as-a-service for robots.
- Dependencies: Massive and diverse datasets; robust goal-specification interfaces; safety evaluation; potential hybridization with pretrained vision backbones for visually complex scenes.
- Autonomous driving world models from camera logs
- Sector: Automotive
- What: Use subspace-regularized latent models for short-horizon planning/prediction from dashcam or surround cameras as a component in stack.
- Potential products/workflows: On-vehicle latent dynamics module to support planners; continuous learning pipelines with effective-rank monitoring.
- Dependencies: Multimodal fusion (LiDAR/Radar); real-time constraints; rigorous validation; regulatory approval.
- Surgical and healthcare robotics
- Sector: Healthcare
- What: Learn stable latent dynamics of surgical scenes/instrument interactions for assistive planning and prediction.
- Potential tools: Sim-to-real pretraining; domain-adapted Sub-JEPA with medical image encoders.
- Dependencies: High safety bar; limited, sensitive data; explainability; integration with clinical workflows.
- Process control and digital twins
- Sector: Manufacturing, energy, chemical process control
- What: Apply subspace Gaussian regularization to latent models of plant dynamics learned from multivariate sensor streams, aligning with low intrinsic dimensionality.
- Potential products: Latent digital twins for forecasting and control; monitoring tools using effective-rank as a stability indicator.
- Dependencies: Extension from vision to multimodal/tabular time series; interpretability; integration with existing control systems; robust handling of nonstationarity.
- AR/VR interactive agents and user-intent prediction
- Sector: AR/VR, human-computer interaction
- What: Stable latent models predicting future scene and user interaction dynamics for anticipatory assistance.
- Potential workflows: On-device Sub-JEPA pretraining on interaction logs; latency-aware inference.
- Dependencies: Privacy-preserving data collection; efficient on-device models; multimodal inputs (gaze, hand pose).
- Financial and demand forecasting as latent dynamics
- Sector: Finance, retail, energy demand
- What: Explore Sub-JEPA-inspired subspace constraints for learning predictive latent dynamics on complex time series to avoid collapse and over-regularization.
- Potential tools: Subspace-regularized latent forecasters; monitoring dashboards with rank/straightness-like metrics.
- Dependencies: Adapting normality tests and projections to non-Gaussian, heavy-tailed distributions; regulatory compliance.
- Adaptive subspace design and auto-tuning
- Sector: ML systems, AutoML
- What: Develop methods that adapt K and ds during training based on intrinsic dimensionality estimates to maintain the best bias-variance tradeoff.
- Potential products: Auto-tuning modules that track effective rank and adjust subspace configuration online.
- Dependencies: Reliable dimensionality estimators; stability of adaptive schedules; minimal overhead.
- Multimodal world models (vision + proprioception + language)
- Sector: Robotics, embodied AI
- What: Extend Sub-JEPA to jointly constrain subspaces across modalities, improving planning and instruction following.
- Potential products: Instruction-following world models with subspace regularizers per modality; language-conditioned planners.
- Dependencies: Cross-modal alignment strategies; large-scale datasets; evaluation protocols for compositional generalization.
- Safety and policy guidelines for world-model deployment
- Sector: Policy/standards
- What: Use diagnostics like effective rank and latent straightness as auditable indicators of representational health in deployed planning systems.
- Potential workflows: Compliance checklists requiring reporting of latent-geometry metrics; model cards for world models.
- Dependencies: Community consensus on metrics; correlation to safety outcomes; standardization bodiesโ adoption.
Notes on feasibility
- Sub-JEPA is validated on four continuous-control, vision-based benchmarks; generalization to other domains (multimodal, non-visual, discrete action spaces) will require adaptation.
- Performance remains data- and hyperparameter-sensitive; extremely small subspaces (very large K) can degrade reliability of the normality test signal.
- The method assumes frozen, orthonormal projections; making these learnable can undermine the anti-collapse effect.
- For highly complex visual environments, pretrained encoders may still outperform purely end-to-end approaches; hybrid strategies may be prudent.
Glossary
- ambient space: The high-dimensional space in which lower-dimensional structures or manifolds are embedded. "latent representations inherently lie on low-dimensional manifolds within a high-dimensional ambient space, and enforcing an isotropic Gaussian prior directly in this ambient space introduces an overly strong bias."
- anti-collapse effect: The property of a regularizer to prevent embeddings from collapsing to trivial, near-constant representations. "This design relaxes the global constraint while preserving its anti-collapse effect, leading to a better balance between training stability and representation flexibility."
- autoregressive generative models: Models that generate future elements by conditioning on previously generated ones. "Autore- gressive generative models such as IRIS [3] and Dream- erV3 [2] couple the world model with an image decoder and achieve strong results in reward-driven settings, but reconstruction-based objectives can produce embeddings that are uninformative for control."
- bias-variance tradeoff: The tension between underconstraining (high variance, risk of collapse) and overconstraining (high bias, limited flexibility) during training. "JEPA training is subject to a bias-variance tradeoff."
- Cramer-Wold theorem: A result stating that a multivariate distribution is determined by the distributions of its one-dimensional projections. "by the Cramer-Wold theorem [22] matching all projected marginals to a Gaussian implicitly enforces an isotropic Gaussian joint distribution."
- curse of dimensionality: The phenomenon where high-dimensional problems become intractable due to exponential growth in complexity. "sketching the embedding distribution with ran- dom directions to bypass the curse of dimensionality."
- DINO-WM: A world model that uses frozen pretrained DINOv2 visual features to stabilize training. "DINO-WM [11] uses a frozen pretrained DINOv2 [33] vi- sual encoder to mitigate representation collapse."
- DINOv2: A self-supervised visual representation model used as a pretrained encoder. "DINO-WM [11] uses a frozen pretrained DINOv2 [33] vi- sual encoder to mitigate representation collapse."
- effective rank: A measure of the effective dimensionality used by a representation, derived from the spectrum of the covariance matrix. "we analyze the effective rank [34] of the learned embeddings."
- empirical covariance matrix: The sample-based covariance matrix used to characterize variability of embeddings. "and let {i} , be the eigenvalues of its empirical covariance matrix."
- Epps-Pulley normality statistic: A test statistic based on the empirical characteristic function to assess normality. "We evaluate the Epps-Pulley [29 nor- mality statistic on this sample set:"
- Gaussian regularization: A constraint pushing projected embeddings to match a Gaussian distribution to prevent collapse. "We then apply Gaussian regularization independently in each subspace."
- I-JEPA: An instantiation of JEPA for images that learns by predicting embeddings of masked or future views. "This recipe has been instantiated in image represen- tation learning with I-JEPA [6],"
- inductive bias: Prior structural preference introduced by model design or regularization that guides learning. "yielding a more flexible inductive bias that better matches the in- trinsic structure of the underlying dynamics."
- isotropic Gaussian distribution: A multivariate normal distribution with identical variance in all directions and zero covariance. "regularizes latent embeddings toward an isotropic Gaussian distribution."
- isotropic Gaussian prior: A prior assumption that embeddings follow an isotropic Gaussian, used as a regularization target. "enforcing an isotropic Gaussian prior directly in this ambient space introduces an overly strong bias"
- JEPA (Joint-Embedding Predictive Architecture): A framework that learns by predicting future or masked embeddings without reconstructing pixels. "Joint-Embedding Predictive Architectures (JEPAs) provide a simple framework for learning world models by predicting future latent representations."
- Johnson-Lindenstrauss lemma: A result guaranteeing distance-preserving random projections into lower-dimensional spaces. "Random projections underpin scalable dimensionality re- duction via the Johnson-Lindenstrauss lemma [23]"
- latent dynamics models: Models that predict future states in a learned latent space rather than pixel space. "Test-time planning via Model Predictive Control over such latent dynamics models has demonstrated strong performance across continuous- control and navigation tasks [15, 11, 9], but requires a well-structured, non-degenerate latent space."
- LeJEPA: A JEPA variant introducing Gaussian regularization with theoretical guarantees. "Gaussian regularization, introduced in LeJEPA [21], takes a principled approach:"
- LeWorldModel (LeWM): A JEPA-based world model that stabilizes training via a global isotropic Gaussian constraint. "The recent LeWorldModel (LeWM) shows that this issue can be alleviated by simply constraining latent embeddings with an isotropic Gaussian prior."
- low-dimensional manifolds: Smooth, lower-dimensional structures where high-dimensional data effectively lies. "the latent representations of natural control tasks typically lie on low-dimensional manifolds embed- ded within the high-dimensional ambient space"
- Model Predictive Control: A planning method that optimizes control actions by rolling out a dynamics model over a future horizon. "Test-time planning via Model Predictive Control over such latent dynamics models has demonstrated strong performance"
- Multi-Subspace Gaussian regularization: The proposed regularizer applying Gaussian constraints in multiple low-dimensional projected subspaces. "full-space Gaussian regularization is replaced by Multi- Subspace Gaussian regularization ( Section 3.3)."
- open-loop rollouts: Forward predictions made by a model without corrective feedback from ground truth observations. "we compare open-loop rollouts of Sub-JEPA and LeWM [12] on Two-Room [11]."
- orthogonal projections: Projections using orthogonal (mutually perpendicular) directions, preserving geometry in subspaces. "In representation learning, orthogonal projections have been used to decorrelate and spread features [20]."
- orthogonality penalty: A loss term encouraging learned projection matrices to remain (approximately) orthogonal. "subject to an orthogonality penalty."
- proprioceptive inputs: Internal sensor measurements (e.g., joint angles/velocities) provided to a model alongside vision. "we report DINO-WM without proprioceptive inputs as the main reference"
- QR decomposition: A matrix factorization into an orthogonal matrix and an upper triangular matrix, used to obtain orthonormal bases. "followed by QR de- composition to obtain an orthonormal basis."
- random projections: Projections using randomly sampled directions to reduce dimensionality while approximately preserving structure. "Random projections underpin scalable dimensionality re- duction via the Johnson-Lindenstrauss lemma [23]"
- representation collapse: The failure mode where an encoder maps diverse inputs to near-identical embeddings. "The central challenge is representation col- lapse [7]: without explicit structural constraints the en- coder can map all inputs to nearly identical embeddings, trivially minimizing the prediction loss while destroying useful structure."
- row-orthonormal projection: A projection matrix whose rows form an orthonormal set, ensuring non-redundant, balanced subspace views. "form the projection matrix Pk E RdsXD, yielding a row- orthonormal projection matrix."
- sliced Wasserstein distances: Metrics that approximate high-dimensional optimal transport via many 1D projections. "and enable tractable distribution matching through sliced Wasserstein distances [24], which reduce high-dimensional optimal transport to a sequence of one-dimensional com- parisons."
- stop-gradient: A training technique preventing gradients from flowing through part of a network, often used to avoid collapse. "Non-contrastive approaches such as BYOL [18] rely on teacher-student asymmetry with stop- gradient,"
- Sub-JEPA: The proposed method that applies Gaussian regularization in multiple random subspaces for stability and flexibility. "In this work, we propose Sub-JEPA, which seeks a more favorable operating point on the bias-variance fron- tier by moving Gaussian regularization from the ambient space into low-dimensional subspaces."
- temporal path straightening: A measure of how linearly latent trajectories evolve over time, indicating smoother dynamics. "we examine latent trajectory geom- etry via temporal path straightening, which measures how linearly dynamics evolve in latent space"
- UMAP: A nonlinear dimensionality reduction technique for visualizing high-dimensional data. "the [CLS] embeddings are projected to 2D via UMAP [36], colored by normalized temporal index."
- VICReg: A self-supervised learning loss enforcing variance, invariance, and covariance constraints to prevent collapse. "as in PLDM [9], which applies VICReg [8] and requires tuning several sensitive hyperparameters."
- Whitening-MSE: A self-supervised objective that whitens features and enforces uniformity on the unit sphere. "Whitening-MSE [20] fur- ther enforces a uniform distribution on the unit sphere."
- world models: Predictive models that capture environment dynamics to support planning and control. "World models (WM) [1, 2], predictive representations of how environments evolve under actions, have become crit- ical building blocks of modern artificial intelligence."
Collections
Sign up for free to add this paper to one or more collections.