POMA-JEPA: Joint Embedding for Molecular & 3D Imaging

Updated 21 November 2025

POMA-JEPA is a joint embedding predictive framework that leverages latent prediction and self-distillation to suppress noise and prioritize high-influence features.
It applies domain-specific strategies in polymer graph analysis, 3D scene understanding, and positronium imaging, achieving enhanced label efficiency and robust feature extraction.
The framework utilizes dynamic patching, lightweight predictors, and EMA-based target encoding, yielding performance gains such as a ~0.15 R² boost and an event resolution of ~80 ps.

POMA-JEPA denotes a class of joint embedding predictive architectures (JEPAs) with domain-specific instantiations in molecular graph learning, 3D scene understanding, and positronium‐based medical imaging. While the acronym “POMA” adopts different concrete referents across scientific literatures—polymer molecular analysis, point map alignment, and positronium orthopositronium measurement analysis—the underlying paradigm in each case exploits predictive learning in latent representation space rather than pixel or feature reconstruction. The POMA-JEPA framework leverages the inductive bias of deep self-distillation to produce semantically meaningful, low-noise features essential for efficient learning in data-scarce or noisy environments (Piccoli et al., 22 Jun 2025, Mao et al., 20 Nov 2025, Littwin et al., 3 Jul 2024).

1. Theoretical Foundations of Joint Embedding Predictive Architectures

The JEPA framework centers on learning representations such that, given a “context” view of data, a predictor network trained on the context embedding can reconstruct the “target” embedding produced by a teacher or exponential moving average (EMA) encoder acting on a correlated or masked view.

Given encoders $f_W$ and $g_V$ , the standard JEPA loss is

$\mathcal{L}_{\rm JEPA} = \frac{1}{2}\,\mathbb{E}_{x,y}\,\left\|\,g_V(f_W(x))\;-\;\mathrm{StopGrad}\{f_W(y)\}\right\|^2,$

where $x, y$ are correlated (or masked) views. This stands in contrast with the masked autoencoder (MAE) objective

$\mathcal{L}_{\rm MAE} = \frac{1}{2}\,\mathbb{E}_{x,y}\,\left\|g_V(f_W(x)) - y\right\|^2,$

which reconstructs missing features in input space. The latent-prediction in JEPA introduces an implicit bias toward high-influence (predictive, low-noise) features, as shown analytically in deep linear models (Littwin et al., 3 Jul 2024).

Key properties and implications:

The gradient flow ODEs of JEPA amplify regression coefficients $\rho_i$ (signal-to-noise) with depth $L$ as $\bar w_{i,\infty}=\rho_i^L$ .
Feature learning exhibits step-wise dynamics: high-influence features are learned rapidly and dominate the representation, while noise-dominated features are suppressed, especially in early training or with early stopping.
Lightweight predictors suffice; overparameterization is unnecessary and may increase risk of collapse.
In nonlinear architectures, empirical results and numerical simulations corroborate this high-influence prioritization (Littwin et al., 3 Jul 2024).

2. POMA-JEPA in Polymer Molecular Graph Representation

The “POMA-JEPA” protocol for polymer molecular graphs employs two GNN-based encoders (context and target), both variants of weighted directed message-passing neural networks (wD-MPNN) (Piccoli et al., 22 Jun 2025). Key technical steps and their rationale:

Graph Construction: Polymer is encoded as a stochastic molecular graph, each node as an atom, edges assigned probabilistic weights reflecting monomer connectivity.
Subgraph Partitioning: Numerous subgraphs (patches) $G_i$ form the basis for constructing context ( $G_x$ ) and (disjoint) target subgraphs ( $G_{y(i)}$ ). Partitioning is performed via random-walk sampling (dynamic), motif-based decomposition (BRICS fragments), or METIS clustering.
Encoder Operations:
- Context encoder $E_{\mathrm{ctx}}$ processes only $G_x$ ; embedding is pooled over nodes.
- Target encoder $E_{\mathrm{tgt}}$ receives the whole graph but only pools target subgraph $G_{y(i)}$ nodes at output.
- Each encoder is 3-layer, hidden dimension $d\approx300$ .
Positional Encoding: Random-walk structural encoding (RWSE) provides positional tokens, mapped linearly into the context embedding input.
Predictor: An MLP $h_\phi$ transforms the concatenated context and positional vectors to predict the latent target embedding.

The POMA-JEPA loss is a mean-squared latent-space prediction between $\widehat{\mathbf{s}}_{y(i)}$ and $\mathbf{s}_{y(i)}$ . An auxiliary “pseudolabel” molecular-weight regression is used, though its impact is modest.

Training and Evaluation

Pretraining occurs on $\sim$ 17k structures, with dynamic context/target subgraphing each step. Downstream tasks involve electron affinity regression and phase classification using the retained target encoder and a shallow head. Dramatic $R^2$ gains ( $\sim$ 0.15 absolute at lowest data regimes) are observed for electron affinity with only $0.4\%$ labels, with the effect saturating by 8% labeled data. Cross-polymer generalization is observed in diblock phase prediction transfer settings. Classic input-space SSL and RF-on-ECFP baselines are outpaced once a moderate number of labels become available (Piccoli et al., 22 Jun 2025).

3. POMA-JEPA in 3D Point Map-Based Scene Understanding

POMA-JEPA as implemented in the POMA-3D framework establishes a foundational self-supervised training signal for permutation-invariant, geometry-aware scene representation (Mao et al., 20 Nov 2025). Overview of the system:

Data and Patchification: Use of structured 2D point maps (explicit 3D coordinates on regular grids) corresponding to depth/RGB-D scenes. Patchifying point maps produces “tokens” for transformer-style ViT encoders.
Context/Target Dual Encoder Structure:
- Context encoder $E_C$ (ViT-B/16 base, LoRA adapted) ingests masked multi-view point maps, producing visible patch embeddings.
- Target encoder $E_T$ is an EMA copy, producing full patch embeddings.
- The predictor $f_\theta$ (2-layer MLP) reconstructs target embeddings of masked patches from context embeddings.
Mathematical Objective: The predictive (POMA-JEPA) loss is a bidirectional Chamfer distance over latent-patch embeddings for masked patches, ensuring permutation- and view-invariance: $\mathcal{L}_{\rm pjepa} = \sum_{i\in\Omega_M}\min_{j\in\Omega_M}\|\widehat Z_T^i - Z_T^j\|_2^2 +\sum_{j\in\Omega_M}\min_{i\in\Omega_M}\|\widehat Z_T^i - Z_T^j\|_2^2$ where $\Omega_M$ is the set of masked patches across all $N_v$ views.
Augmentation and Training: Context patches are generated by rectangular masking with random scales/aspect ratios; augmentations are strictly spatial. Two-phase curriculum: warm-up on single-view, then multi-view pretraining with additional alignment contrastives (to 2D image/text embeddings from FG-CLIP).
Resulting Representations: Enforcing multi-view consistency with POMA-JEPA yields measurable gains: on SQA3D, EM@1 increases from 50.7% to 51.1%, and Hypo3D 32.9% to 33.4%. These effects compound with 2D alignment, improving embodied reasoning (Mao et al., 20 Nov 2025).

4. POMA-JEPA for Ortho-Positronium Imaging and Lifetime Tomography

In the positronium imaging context, POMA-JEPA refers to a reconstruction protocol leveraging quantum electrodynamics‐predicted 3γ annihilation kinematics, a high-resolution (~40 ps) plastic scintillator-based J-PET detector, and analytic trilateration to reconstruct eventwise annihilation points and o-Ps lifetime on an event-by-event basis (Moskal et al., 2018).

Detector and Physical Basis:
- Plastic scintillator J-PET, multiple concentric cylindrical layers, double-ended PMT/SiPM readout.
- DET (CRT) $\sim$ 140–200 ps.
- Ortho-positronium decays sampled from QED-model 3γ distribution, Compton scatter in plastics provides both timing/energy signal for all three photons plus prompt.
Reconstruction Algorithm:
- Identify four hits per event: three Compton-appropriate for o-Ps → 3γ, one high-energy prompt γ.
- Rotational transformation to decay plane: enforce three time–distance equations to solve for unknown $(x',y',t)$ per event.
- Lifetime per event $\tau = t_1-t_0$ , where $t_0$ is prompt γ time and $t_1$ is annihilation time from trilateration.
- Voxelization: sample arithmetic mean or maximum-likelihood fit to decay spectrum in spatial voxels of the imaging volume.
Performance:
- Single-event annihilation time resolution $\sim$ 80 ps.
- Mean lifetime precision $\sim$ 40 ps for $N=3,000$ events/voxel.
- Imaging can differentiate o-Ps lifetime differences $O(100\,\mathrm{ps})$ , sufficient for separating healthy from neoplastic tissue in preclinical/clinical PET (Moskal et al., 2018).

5. Implicit Bias and Practical Design Insights

The core insight governing POMA-JEPA's effectiveness is the implicit bias of latent-prediction self-distillation. In both deep linear theory and practical nonlinear networks, JEPA induces a dynamic ordering of feature learning and converges to a low-rank embedding where high-influence, predictive features are selectively amplified and noise-dominated features are suppressed (Littwin et al., 3 Jul 2024). Practical recommendations include:

Use deep encoders ( $L\gg1$ ) to maximize selectivity.
Employ lightweight predictor heads.
Prefer current weight initialization and moderate regularization (weight decay, spectral norm) when needed.
Avoid unnecessary contrastive negatives in JEPA-style self-distillation due to near-automatic suppression of noise via StopGrad.

Early stopping or aggressive learning-rate schedules are not only sufficient but beneficial for yielding clean, abstract representations. This principle directly underpins POMA-JEPA's downstream label efficiency in molecular property prediction and geometric QA tasks (Piccoli et al., 22 Jun 2025, Mao et al., 20 Nov 2025).

6. Comparative Summary and Domain-Specific Extensions

POMA-JEPA, as realized in polymer graph SSL, 3D scene representation, and positronium tomography, uniquely positions itself as an alternative to input-space masked autoencoding. Direct comparison to input-space SSL in polymers shows superior $R^2$ and transfer robustness at low label fractions; in 3D, it complements view-to-scene alignment, injecting geometric consistency. In positronium imaging, the analytic inversion and imaging pipeline derived from JEPA-style reasoning achieves lifetime sensitivity unachievable in pure input-reconstruction schemes.

Researchers continue to explore extensions:

Chemistry-aware or adaptive patching in polymers.
Multimodal fusions (e.g., augmenting graph-based representations with sequence or spectral modalities).
Cross-domain transfer in geometric tasks via compounded alignment and prediction losses.
Application to total-body PET for clinical spatial-lifetime mapping.

POMA-JEPA thus establishes a unifying paradigm for representation learning in structured and high-dimensional domains, efficiently harnessing the self-distillation bias toward semantics and away from noise (Littwin et al., 3 Jul 2024, Piccoli et al., 22 Jun 2025, Mao et al., 20 Nov 2025, Moskal et al., 2018).