Observed Transition Factorization (OTF)

Updated 2 July 2026

Observed Transition Factorization is a method that decomposes state transitions into sparse, interpretable primitives, enabling robust structure discovery in high-dimensional and ambiguous environments.
It uses a two-stage process—primitive extraction and latent action aggregation—to model transitions in both visual dynamical systems and Markov processes.
Empirical results reveal that OTF improves policy learning, enhances transfer across morphologies and visual modes, and effectively partitions complex networks.

Observed Transition Factorization (OTF) is a factorization methodology for decomposing observed state transitions into interpretable, sparse, and reusable primitives. Developed to address identifying structured transitions in high-dimensional, ambiguous, or partially observed environments, OTF provides a bottom-up representation of transitions, enabling robust latent action modeling, domain transfer, and efficient policy learning under challenging conditions such as distractors and morphology shifts. The approach is implemented in both online matrix factorization for Markov processes in complex networks (Yang et al., 2017) and in visual dynamical systems for latent action inference (Nam et al., 29 Jun 2026).

1. Mathematical Foundation of Observed Transition Factorization

OTF is predicated on the insight that observed transitions—whether in discrete Markovian state spaces or high-dimensional continuous domains (e.g., images)—can be approximated by a sparse linear combination of transition primitives. Formally, given observed transitions $\Delta s_t = s_{t+1} - s_t$ or in visual domains, $\Delta x_t = x_{t+\tau} - x_t$ , OTF seeks a dictionary $\{p_k\}_{k=1}^K$ and nonnegative activation vectors $\alpha_{t,k}$ satisfying

$\Delta s_t \approx \sum_{k=1}^K \alpha_{t,k}\,p_k, \qquad \alpha_{t,k}\ge 0,\quad \|\alpha_t\|_0\ll K,$

with a canonical factorization loss

$\min_{p_{1:K},\{\alpha_t\}} \sum_t \Big\|\Delta s_t-\sum_{k=1}^K\alpha_{t,k}\,p_k\Big\|_2^2 + \lambda_{1}\sum_t\|\alpha_t\|_{1} + \lambda_{2}\sum_{k\neq k'}\bigl\langle p_k,p_{k'}\bigr\rangle^2,$

promoting accurate reconstruction, activation sparsity, and diverse primitive representations (Nam et al., 29 Jun 2026).

In network analysis, analogous matrix factorization is posed for an unknown, low-rank Markov operator $P \in \mathbb{R}^{d \times d}$ with observed transitions $(i_t \to j_t)$ : $P \approx X X^\top, \quad X \in \mathbb{R}^{d \times k},\quad X^\top X = I_k,$ with minimization of $\|P - X X^\top\|_F^2$ under orthogonality constraints (Yang et al., 2017).

2. Algorithmic Frameworks: OTF in World Modeling and Network Analysis

In the latent action modeling context, OTF is structured as a two-stage process:

Stage 1: Primitive Extraction

Input: Sets of state pairs $\Delta x_t = x_{t+\tau} - x_t$ 0 or transitions.
Motion-centric input $\Delta x_t = x_{t+\tau} - x_t$ 1 (e.g., frame differences or spatial gradients) is encoded patchwise via a learned encoder $\Delta x_t = x_{t+\tau} - x_t$ 2, producing features $\Delta x_t = x_{t+\tau} - x_t$ 3.
Quantization: Each patch feature is mapped to a codebook vector $\Delta x_t = x_{t+\tau} - x_t$ 4 by nearest-neighbor search, forming code assignments $\Delta x_t = x_{t+\tau} - x_t$ 5 and quantized codes $\Delta x_t = x_{t+\tau} - x_t$ 6.
Statistical representations: For each code $\Delta x_t = x_{t+\tau} - x_t$ 7, the occupancy map $\Delta x_t = x_{t+\tau} - x_t$ 8 tracks patch spatial assignments, and $\Delta x_t = x_{t+\tau} - x_t$ 9 tracks activation strength.
A small decoder $\{p_k\}_{k=1}^K$ 0 reconstructs $\{p_k\}_{k=1}^K$ 1 from these factors, optimizing the combined loss for reconstruction, vector quantization, commitment, and code orthogonality.

Stage 2: Latent Action Aggregation (OTF-LAM)

For each time $\{p_k\}_{k=1}^K$ 2, the frozen OTF tokenizer yields codes, occupancy, and activations.
Each primitive is embedded (state-aware factor embedding) via a network $\{p_k\}_{k=1}^K$ 3.
Gating: Factors are softly gated, $\{p_k\}_{k=1}^K$ 4, producing a sparse weighted set.
Aggregation yields a compact latent action $\{p_k\}_{k=1}^K$ 5 via averaging and optional projection.
Forward Model: A decoder $\{p_k\}_{k=1}^K$ 6 predicts future state or frame via $\{p_k\}_{k=1}^K$ 7.
Training minimizes the next-frame prediction error with the OTF tokenizer held fixed.

OTF-LAM-DINO replaces the pixelwise decoder with prediction in a frozen DINOv2 representation space, with loss defined as $\{p_k\}_{k=1}^K$ 8, benefiting from learned, domain-agnostic visual features (Nam et al., 29 Jun 2026).

For online network factorization, a stochastic generalized Hebbian algorithm updates $\{p_k\}_{k=1}^K$ 9 per observed transition, using a projection onto the Stiefel manifold. Under proper step-size schedules and spectral gap assumptions, convergence to principal eigenspaces and optimal sample complexity is achieved (Yang et al., 2017).

3. Empirical Performance and Transferability

OTF provides substantial empirical improvements in factor reusability, policy learning, and network partitioning.

Zero-shot transfer: OTF primitives transfer robustly across agent morphologies (e.g., Walker→Cheetah) and across visual modes (e.g., MovingMNIST digit classes), with relative MSE degradation ("drop") on the order of 20–50% (depending on transform), compared to 58–72% for monolithic vector quantization methods. This indicates an improved separation between local, reusable transition effects and global, context-specific templates (Nam et al., 29 Jun 2026).
Policy learning: In downstream task imitation, OTF-LAM and OTF-LAM-DINO demonstrate competitive or superior average returns versus several baselines under distractors. For example, OTF-LAM-DINO achieves $\alpha_{t,k}$ 0 on Cheetah-Run (vs. $\alpha_{t,k}$ 1 for FLAM-4, $\alpha_{t,k}$ 2 for HiLAM) and $\alpha_{t,k}$ 3 on Walker-Run (Nam et al., 29 Jun 2026).
Capacity: Increasing the code vocabulary $\alpha_{t,k}$ 4 enhances OTF-LAM performance up to $\alpha_{t,k}$ 5 on specific tasks, while OTF-LAM-DINO peaks around $\alpha_{t,k}$ 6, reflecting $\alpha_{t,k}$ 7 as a tunable capacity rather than a critical hyperparameter.
Network partitioning: OTF methods recover meaningful city partitions in Manhattan taxi flow, achieving modularity above $\alpha_{t,k}$ 8 and tight correspondence with known neighborhoods (Yang et al., 2017).

4. Exact Recovery and Theoretical Guarantees

OTF admits strong guarantees under Markov process lumpability and spectral separation. Specifically:

Exact recovery: If the Markov chain's partition is lumpable and a sufficient spectral gap is present, OTF achieves exact block recovery with high probability ( $\alpha_{t,k}$ 9) after $\Delta s_t \approx \sum_{k=1}^K \alpha_{t,k}\,p_k, \qquad \alpha_{t,k}\ge 0,\quad \|\alpha_t\|_0\ll K,$ 0 samples (Yang et al., 2017).
Sample complexity: To ensure $\Delta s_t \approx \sum_{k=1}^K \alpha_{t,k}\,p_k, \qquad \alpha_{t,k}\ge 0,\quad \|\alpha_t\|_0\ll K,$ 1 with high probability, $\Delta s_t \approx \sum_{k=1}^K \alpha_{t,k}\,p_k, \qquad \alpha_{t,k}\ge 0,\quad \|\alpha_t\|_0\ll K,$ 2 transitions suffice.
Global convergence: Under properly diminishing stepsizes and mixing, stochastic OTF updates converge almost surely to the span of top eigenvectors of $\Delta s_t \approx \sum_{k=1}^K \alpha_{t,k}\,p_k, \qquad \alpha_{t,k}\ge 0,\quad \|\alpha_t\|_0\ll K,$ 3.

This suggests that in both controlled and complex environments, OTF enables interpretable and reliable structure discovery from transition data, without explicit supervision or prior knowledge of state/action semantics.

5. Implementation and Algorithmic Details

OTF is instantiated via distinct but conceptually unified procedures in world modeling and Markov network contexts.

OTF Primitives (World Modeling):

Training involves an encoder–decoder pipeline with a learned codebook, quantizing patchwise motion or state-difference signals.
Sparsity is encoded via $\Delta s_t \approx \sum_{k=1}^K \alpha_{t,k}\,p_k, \qquad \alpha_{t,k}\ge 0,\quad \|\alpha_t\|_0\ll K,$ 4 loss terms on activations; diversity via pairwise orthogonality penalties.
The full pipeline is optimized end-to-end until convergence, then encoder and codebook are frozen for downstream tasks.

OTF-LAM and OTF-LAM-DINO:

Algorithmic stages include:

Tokenization of observed motion or transition via the OTF encoder;
Embedding, gating, and aggregation of primitive activations;
Prediction of subsequent state or DINO feature via a learned dynamic model.

Pseudocode for each stage is provided in (Nam et al., 29 Jun 2026). No new statistics, pseudocode, or tool names not present in the original data are introduced.

Markov Chain Setting:

Upon observing a transition $\Delta s_t \approx \sum_{k=1}^K \alpha_{t,k}\,p_k, \qquad \alpha_{t,k}\ge 0,\quad \|\alpha_t\|_0\ll K,$ 5, $\Delta s_t \approx \sum_{k=1}^K \alpha_{t,k}\,p_k, \qquad \alpha_{t,k}\ge 0,\quad \|\alpha_t\|_0\ll K,$ 6 is updated via a projected stochastic gradient with nonconvex objectives.
Orthonormality is preserved by projection onto the Stiefel manifold (e.g., via QR factorization).

6. Applications, Limitations, and Open Questions

OTF is demonstrated in large-scale partitioning of city traffic networks and in visually complex dynamical systems with distractors or ambiguous transition sources.

Applications: Traffic region discovery, zero-shot motion code transfer, improved latent action policy learning, and partition recovery in large networks.
Limitations: Success depends on the adequacy of the primitive vocabulary size $\Delta s_t \approx \sum_{k=1}^K \alpha_{t,k}\,p_k, \qquad \alpha_{t,k}\ge 0,\quad \|\alpha_t\|_0\ll K,$ 7 and the representational adequacy of patchwise encoders. Performance can be sensitive to the specified transforms (e.g., velocity vs. acceleration, Sobel vs. gradient filters).
Open questions: How to optimally select or adapt $\Delta s_t \approx \sum_{k=1}^K \alpha_{t,k}\,p_k, \qquad \alpha_{t,k}\ge 0,\quad \|\alpha_t\|_0\ll K,$ 8 for unseen domains; how to guarantee interpretability of primitives under nonlinear context; and extension to non-Markovian or temporally extended transition factorizations.

7. Connections to Broader Literature

OTF generalizes online matrix factorization paradigms from implicit large-scale networks (Yang et al., 2017) to high-dimensional continuous observation spaces and world modeling for reinforcement learning (Nam et al., 29 Jun 2026).

Compared to monolithic vector quantization, OTF provides increased transferability and interpretability by imposing spatial, sparse, and orthogonal structuring on primitive codes.
In comparison to classical spectral clustering, OTF is applicable in settings lacking explicit transition matrices, relying instead on raw transition samples or temporal state observations.
The OTF-LAM-DINO framework leverages self-supervised vision representation learning (e.g., DINOv2) as a fixed representation space for decoder-free world modeling.

In summary, OTF establishes a scalable, flexible framework for unsupervised structure discovery in dynamically observed systems, bridging graph factorization, visual representation learning, and latent action modeling (Yang et al., 2017, Nam et al., 29 Jun 2026).

Markdown Report Issue Upgrade to Chat

References (2)

Online Factorization and Partition of Complex Networks From Random Walks (2017)

Latent Actions from Factorized Transition Effects under Agent Ambiguity (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Observed Transition Factorization (OTF).