Static-Teacher Asymmetric Latent Training (SALT)

Updated 3 July 2026

The paper introduces SALT as a two-stage teacher-student framework where a fixed teacher supplies latent supervision to enhance the student’s performance.
The student model replicates both primary outputs and latent targets, achieving improved accuracy and efficiency through detailed loss functions.
Empirical results demonstrate SALT’s effectiveness in MLIPs, medical imaging, and video learning, outperforming traditional online distillation methods.

Static-teacher Asymmetric Latent Training (SALT) refers to a general framework for two-stage teacher-student training, where a frozen or “static” teacher model provides latent representations as supervision targets to a student model of typically smaller or otherwise optimized architecture. Unlike traditional self-distillation or joint-embedding methods employing an online, EMA-updated teacher, SALT decouples the student’s learning process from the teacher by keeping the teacher’s parameters fixed throughout student training. SALT has been instantiated in domains such as machine learning interatomic potentials (MLIPs) (Matin et al., 7 Feb 2025), medical ultrasound representation learning (Radhachandran et al., 22 Feb 2026), and video self-supervised learning (Li et al., 29 Sep 2025), providing improvements in accuracy, computational efficiency, and interpretability.

1. SALT Objectives and Loss Functions

The SALT framework consistently follows a two-stage structure:

Teacher Stage: A teacher model is trained with access to the complete data signal and supervised to produce high-quality latent representations.
Student Stage: The teacher model is then frozen, and a separate student model is trained to reproduce not only primary outputs (e.g., energies, forces, or class labels) but also the latent representations (“pseudo-labels”) yielded by the teacher on either local (e.g., per-atom), patch-, or block-level targets, depending on the domain.

MLIPs Example (Matin et al., 7 Feb 2025):

Let $\mathcal{D} = \{ (R, Z) \rightarrow (E_i, F_i) \}$ denote molecular snapshots, with $R$ atomic positions, $Z$ species, $E$ total energies, and $F$ forces.

Teacher loss:

$\mathcal{L}_{\rm T} = w_{E} \left( \mathrm{RMSE}(\hat E, E) + \mathrm{MAE}(\hat E, E) \right) + w_{F} \left( \mathrm{RMSE}(\hat F, F) + \mathrm{MAE}(\hat F, F) \right) + w_{L2} \|\Theta\|_{2}^{2} + w_{R} \mathcal{L}_{R}$

Student loss (SALT):

$\mathcal{L}_{\rm S} = w_{E} \left( \mathrm{RMSE}(\hat E, E) + \mathrm{MAE}(\hat E, E) \right) + w_{F} \left( \mathrm{RMSE}(\hat F, F) + \mathrm{MAE}(\hat F, F) \right) + w_{A} \left( \mathrm{RMSE}(\epsilon^{\mathcal S}, \epsilon^{\mathcal T}) + \mathrm{MAE}(\epsilon^{\mathcal S}, \epsilon^{\mathcal T}) \right) + w_{L2} \|\Theta\|_{2}^{2} + w_{R} \mathcal{L}_{R}$

Masked Latent Prediction in Imaging (Radhachandran et al., 22 Feb 2026, Li et al., 29 Sep 2025):

The student predicts the teacher’s latent representations for masked input regions. The loss is typically an average distance (e.g., Smooth L1 or $\ell_1$ ) between the predicted and static teacher embeddings over masked subsets.

2. Architectural and Procedural Details

SALT implementations share key architectural characteristics, emphasizing explicit asymmetry between the teacher and student optimization.

Teacher Model: Trained once with pixel-level, energy-level, or force-level supervision; frozen for all subsequent student training. Teacher’s latent outputs are precomputed (MLIPs) or generated on-the-fly (imaging).
Student Model: Often has reduced parameter count, depth, or receptive field, and is trained solely to minimize the loss relative to both direct outputs and per-target latent representations provided by the teacher.

Representative architectural parameters (MLIPs, (Matin et al., 7 Feb 2025)):

Model	#Params	Depth × Width	Force RMSE [eV/Å]	Time (μs/step/atom)	Mem. (MB/atom)
Teacher	1.14 M	4 × 128 (atom layers)	0.092	7.8	0.53
Student A	0.29 M	2 × 64 (atom layers)	0.083	5.5	0.35
Ctrl (no SALT)	0.29 M	2 × 64	0.094	5.5	0.35

Vision and Video (US-JEPA, V-JEPA2 (Radhachandran et al., 22 Feb 2026, Li et al., 29 Sep 2025)):

Teacher: ViT-Base/16 or ViT-Large encoders, supervised by masked autoencoding.
Student: Similar or larger ViT with distinct predictor head, trained to reconstruct fixed teacher latents on masked regions.
Masking: Structured region/block-wise masking, typically retaining a high context ratio (e.g., 85–90%) and predicting multiple target regions post-hoc.

3. Asymmetric Optimization and Theoretical Motivation

The central defining feature is the asymmetry: only the student’s parameters are updated during the latent-alignment phase, while the teacher’s latent outputs are fixed.

Motivation:

Traditional distillation relies on matching final logits or scalar outputs. SALT augments supervision with $O(N)$ additional pseudo-labels per data sample (e.g. per-atom energies, per-patch representations), greatly increasing the “signal” without inflating annotation budgets.

Theoretical advantage:

This overdetermined constraint reliably guides the student’s intermediate representation to approximate the teacher's local decomposition, improving data efficiency, regularization, and downstream accuracy.

In multi-objective settings (reproducing both global and latent targets), loss weights (e.g., $w_E$ , $R$ 0, $R$ 1) are often annealed on different schedules, enhancing convergence and stability.

4. Empirical Results and Comparative Performance

SALT-trained students consistently match or surpass both same-size non-SALT controls and often their own teachers, achieving superior efficiency.

MLIPs (Matin et al., 7 Feb 2025):
- Students (e.g., (64×2) and (32×3)) achieve lower force RMSE and memory per atom than teachers or equal-capacity controls.
- Rapid convergence and improved Pareto efficiency (accuracy vs. compute/memory).
- In "Born Again" (equal-size student) mode, students can slightly outperform their teachers (force RMSE ~0.083 vs. 0.092 eV/Å).
Medical Ultrasound (Radhachandran et al., 22 Feb 2026):
- On UltraBench, US-JEPA (SALT) achieves state-of-the-art in 5 of 8 major tasks, e.g., 89.2% F1 on breast malignancy, +18% F1 advantage in few-shot benchmarks.
- SALT-trained models demonstrate robustness to noise and degradation (e.g., under severe blur, minimal F1 drop relative to the teacher).
Video Representation (Li et al., 29 Sep 2025):
- At matched pretraining FLOPs, SALT outperforms EMA-based V-JEPA 2 (76.2% vs. 75.3% on SSv2) and exhibits more favorable scaling curves.
- Student quality is robust to the choice and quality of the static teacher; some smaller or partially trained teachers still produce high-performing students.

5. Advantages, Limitations, and Ablation Analyses

Advantages:

Decoupling: Teacher and student can be trained independently and with distinct architectures, enhancing flexibility.
Stability: The use of frozen targets eliminates training collapse and removes the need for EMA hyperparameters or stop-gradient heuristics.
Efficiency: Memory and compute costs are reduced by avoiding maintenance of a live teacher and backpropagating through a single network.

Ablations:

Controls with $R$ 2 (no latent loss) reliably underperform those with $R$ 3, confirming the importance of latent supervision.
In video and imaging, replacing SALT’s static teacher with an EMA-coupled online teacher increases convergence variance, computational overhead, and does not improve downstream metrics (Radhachandran et al., 22 Feb 2026, Li et al., 29 Sep 2025).
Sensitivity to latent loss weight $R$ 4 exhibits a distinct optimum (around 100 in MLIPs); larger values yield diminishing returns.
Teacher quality matters, but high-performing students can emerge even for sub-optimal teachers, suggesting optimal resource allocation should overwhelmingly favor student capacity (Li et al., 29 Sep 2025).

6. Domain Adaptation and Extensions

SALT has been adapted to diverse domains while retaining its static-teacher design:

Materials Simulation (MLIPs) (Matin et al., 7 Feb 2025): Student models for interatomic potential prediction achieve efficient and accurate simulations in large-scale molecular dynamics.
Medical Imaging (US-JEPA) (Radhachandran et al., 22 Feb 2026): SALT enables robust and transferable ultrasound representations, accommodating high variation and noise.
Video Self-supervision (Li et al., 29 Sep 2025): The approach provides compute-efficient alternatives to online joint-embedding architectures, yielding strong off-the-shelf representations for downstream probing.

A notable extension is the observation that, in both imaging and video, students can be trained with more capacity or compute than the teacher as fixed targets are unbounded in variety and complexity.

7. Practical Considerations and Impact

SALT’s methodological simplicity—rooted in freezing the teacher and regressing to its precomputed or on-the-fly latent targets—eliminates the complexity of online teacher-student coupling and reduces susceptibility to hyperparameter misconfiguration. Loss curves for student SALT training often correlate directly with downstream evaluation (e.g., $R$ 5 for probing accuracy in video), streamlining model selection and validation.

By achieving Pareto-optimal tradeoffs in accuracy, compute, and memory, SALT is positioned as a scalable, interpretable, and efficient successor to online teacher-student and self-distillation frameworks, with applicability spanning physical simulation, medical imaging, and multimodal video learning (Matin et al., 7 Feb 2025, Radhachandran et al., 22 Feb 2026, Li et al., 29 Sep 2025).