Self-Predictive Representations (SPR)

Updated 14 January 2026

SPR is a self-supervised framework that learns representations by predicting its own future latent embeddings over multiple steps and action-conditional rollouts.
It employs a dual-branch design with online and EMA-updated target networks, stop-gradient operations, and data augmentations to ensure robust learning.
Empirical studies show that SPR improves sample efficiency, generalization, and robustness in applications like deep RL, behavioral cloning, and spatial-temporal forecasting.

Self-Predictive Representations (SPR) refer to a class of self-supervised objectives and algorithms in which an agent learns representations by predicting its own future latent representations, generally over multiple steps and under various action-conditional or model-based rollouts. SPR methods have become central to deep reinforcement learning, behavioral cloning, and spatial-temporal prediction, offering principled approaches for encoding temporal and dynamical structure in learned features. SPR frameworks combine theoretical grounding—linked to spectral properties of the Markov transition operator and success or representations—with robust algorithmic mechanisms such as predictor/target networks, stop-gradient dynamics, and data augmentations.

1. Formal Definition and Core Objective

At the heart of SPR is a process that learns an encoder $\phi$ mapping states (or histories) to a latent space $\mathcal{Z}$ , and a predictor or transition model $g_\theta$ that, conditioned on the current latent and action, predicts the next-step latent embedding. The canonical loss, in its simplest action-free form, is

$\mathcal{L}_{\text{SPR}}(\phi, \theta) = \mathbb{E}_{h, a, o' \sim \mathcal{D}} \left\|g_\theta(\phi(h), a) - \text{sg}[\phi(h')]\right\|_2^2,$

where $\text{sg}[\cdot]$ denotes stop-gradient, $h$ is a history (or observation), $a$ is an action, $h'$ is the next-step history, and $g_\theta$ is a parametric transition model. In practice, self-prediction may be carried out over multiple steps ( $k > 1$ ), using either recursive models or bootstrapped multi-step heads (Schwarzer et al., 2020, Lawson et al., 11 Jun 2025). Data augmentations, such as pixel shifts or intensity jitter in visual domains, are typically included to regularize representations against spurious information (Schwarzer et al., 2020).

A more general formulation, especially in tabular or linearized settings, makes explicit the joint learning of an encoder $\Phi$ and a linear predictor $P$ :

$L(\Phi, P) = \mathbb{E}_{x \sim d, y \sim P^\pi(\cdot | x)} \left\|P^T \Phi^T x - \Phi^T y\right\|_2^2,$

where $x$ and $y$ are (one-hot) states and $P^\pi$ is the policy-induced transition kernel (Tang et al., 2022).

2. Theoretical Foundation and Representation Structure

Extensive theoretical analyses reveal that, under idealized optimization dynamics (two-timescale: predictor optimality and semi-gradient update for encoder), SPR objectives induce encoders whose span corresponds to dominant spectral components of the transition kernel $P^\pi$ (Tang et al., 2022, Khetarpal et al., 2024). In particular, with orthonormal initialization and symmetric transition matrices:

One-step SPR (BYOL- $\Pi$ ) dynamics converges to the top- $k$ eigenvectors of $(P^\pi)^2$ .
Action-conditional SPR (BYOL-AC) recovers the dominant eigenvectors of $|A|^{-1} \sum_a P_a^2$ .
Bidirectional extensions (learning left/right singular vectors) capture the top singular modes via coupled forward/backward predictors.

SPR features are thus tied to low-rank projections of the MDP's intrinsic structure, often highly correlated with value and Q-function regression tasks (Khetarpal et al., 2024). In the deep setting, properly regularized SPR avoids trivial solutions (collapse to constant representations), a property provably enforced by the use of stop-gradient and momentum updates for the target networks (Ni et al., 2024).

3. Algorithmic Design and Training Dynamics

SPR algorithms adhere to a two-branch (online and target) structure:

The online encoder/predictor $(\phi, g_\theta)$ consume and process state observations, producing latent predictions.
The target encoder, maintained via exponential moving average (EMA) update of online parameters, generates targets for self-prediction while being frozen with respect to the current gradient computation (Schwarzer et al., 2020, Lawson et al., 11 Jun 2025).
Multi-step unrolling: Recursive application of the transition model/predictor produces rollouts up to horizon $K$ , with losses aggregated over steps.
Data augmentation may be used to enforce consistency across transformations, leading to auxiliary invariance terms in the objective.

A high-level loop for sprite-based RL, as in Atari, is shown below (notation adapted from (Schwarzer et al., 2020)):

Initialize encoders θ, θ', predictor ϕ
Replay buffer B
for each env step:
    observe o_t, act with Q_ψ policy, store (o, a, r, o') in B
    sample batch of sequences from B
    for k = 1..K:
        encode online z_t = f_θ(o_t), target z_{t+k}' = f_{θ'}(o_{t+k})
        roll out latent transition via h_ϕ
        accumulate self-prediction loss L_pred = Σ_k ||ẑ_{t+k} - z'_{t+k}||^2
    compute consistency loss if augmenting
    total loss = RL + λ_SPR (SPR + consistency)
    update θ, ϕ, ψ by SGD
    update θ' ← τ θ' + (1-τ) θ    # EMA momentum

Tabular and linear-function-approximation analogs resolve the predictor to closed-form at each step, underlining the separation of timescales crucial for non-collapse (Tang et al., 2022).

4. Variants: Action-Conditioning, Multi-Step, and Model-Based SPR

Significant advances in SPR formulations arise from making predictors action-conditional, generalizing the loss from a fixed-policy transition to explicit modeling of per-action transitions, e.g.,

$L_{AC}(\varphi, \psi) = \mathbb{E}_{x, a, y} \| P_a^T \Phi^T x - \text{sg}(\Phi^T y) \|^2,$

with $P_a$ a learned predictor per action (Khetarpal et al., 2024). This modification more accurately matches the structure of control environments and yields features better suited to Q-value approximation.

Further, multi-step bootstrapping (e.g., $K \sim 5$ ) with geometric weighting ( $\gamma$ ) leads the induced representations (BYOL- $\gamma$ ) to approximate successor features in linear MDPs (Lawson et al., 11 Jun 2025). The solution to the least-squares multi-step self-prediction is (in expectation and under linearity) the successor representation matrix $(I - \gamma P)^{-1}$ , which admits a Bellman-consistent solution relevant for model-based planning and robust behavior cloning.

5. Empirical Performance and Practical Properties

Empirical validations in pixel-based RL, behavioral cloning, and spatial-temporal sequence modeling consistently demonstrate that SPR-trained agents exhibit improved sample efficiency, better generalization, and robustness to noise and distractions compared to alternative self-supervised or model-free baselines.

Illustrative results include:

On the pixel-based Atari 100k benchmark, SPR achieves a median human-normalized score of 0.415 (mean 0.704), a 55% improvement over prior art, exceeding expert human scores on 7/26 games (Schwarzer et al., 2020).
In combinatorial generalization scenarios for robotic behavioral cloning, SPR reduces the out-of-distribution gap by 30–50% absolute success rate versus standard encoders (Lawson et al., 11 Jun 2025).
In MuJoCo and MiniGrid RL, SPR outperforms both model-free and observation-predictive methods in both sample efficiency and distraction robustness; performance further improves as prediction horizon $K$ increases (Ni et al., 2024).
In spatial-temporal forecasting, ST-ReP leverages a joint SPR-based reconstruction/prediction loss, outperforming contrastive and standard self-supervised baselines by 20% MSE on traffic benchmarks while maintaining scalability (Zheng et al., 2024).

Empirical ablations emphasize the necessity of multi-step prediction, EMA-based target networks, and properly normalized or cosine prediction losses for representation non-collapse and high performance (Schwarzer et al., 2020, Lawson et al., 11 Jun 2025, Tang et al., 2022).

6. Connections to Successor Representations and Other Abstraction Hierarchies

SPR objectives are closely related to well-established concepts in value-predictive and successor-feature learning. When the self-prediction loss is extended over multiple discounted steps, the optimal encoder approximates the successor representation (SR) defined by

$\psi_{SR}(s) = \mathbb{E}\left[ \sum_{k=1}^\infty \gamma^{k-1} \phi(s_{t+k}) \mid s_t = s \right],$

which forms a basis for long-horizon reachability geometry and combinatorial generalization (Lawson et al., 11 Jun 2025). Theoretical work delineates a hierarchy:

$\text{Observation-predictive} \succ \text{Self-predictive} \succ Q^*\text{-irrelevance} \succ \pi^*\text{-irrelevance}$

clarifying which abstraction is captured by a particular prediction loss (Ni et al., 2024). Notably, a feature map satisfying next-latent prediction (ZP) plus $Q^*$ -irrelevance suffices for reward prediction, establishing a tight connection between model-based and model-free RL (Ni et al., 2024).

7. Algorithmic Guidelines, Limitations, and Extensions

Practitioners are advised to apply self-predictive ZP objectives for tasks with noisy, distracting inputs, and to prefer observation-predictive losses in sparse-reward or highly partial-observability settings (Ni et al., 2024). Tuning auxiliary loss weights, leveraging EMA targets, and sharing encoders between policy/value heads and self-prediction modules is recommended for robust training.

Major limitations include dependency on hyperparameter choices for stability (e.g., prediction depth, momentum, augmentation strength) and architectural bottlenecks (e.g., shallow predictors, linear spatial extractors in ST-ReP) (Schwarzer et al., 2020, Zheng et al., 2024). Extensions to handle long-range temporal contexts, structured graph interactions, or bidirectional/self-inverse modeling are active areas of research (Tang et al., 2022, Zheng et al., 2024).

SPR provides a principled, scalable framework for state, history, and spatio-temporal representation learning, underpinning new advances in data-efficient RL, model-based planning, and generalization in partially observed or highly combinatorial domains (Schwarzer et al., 2020, Khetarpal et al., 2024, Lawson et al., 11 Jun 2025, Zheng et al., 2024).