Observed Transition Factorization (OTF)
- Observed Transition Factorization is a method that decomposes state transitions into sparse, interpretable primitives, enabling robust structure discovery in high-dimensional and ambiguous environments.
- It uses a two-stage process—primitive extraction and latent action aggregation—to model transitions in both visual dynamical systems and Markov processes.
- Empirical results reveal that OTF improves policy learning, enhances transfer across morphologies and visual modes, and effectively partitions complex networks.
Observed Transition Factorization (OTF) is a factorization methodology for decomposing observed state transitions into interpretable, sparse, and reusable primitives. Developed to address identifying structured transitions in high-dimensional, ambiguous, or partially observed environments, OTF provides a bottom-up representation of transitions, enabling robust latent action modeling, domain transfer, and efficient policy learning under challenging conditions such as distractors and morphology shifts. The approach is implemented in both online matrix factorization for Markov processes in complex networks (Yang et al., 2017) and in visual dynamical systems for latent action inference (Nam et al., 29 Jun 2026).
1. Mathematical Foundation of Observed Transition Factorization
OTF is predicated on the insight that observed transitions—whether in discrete Markovian state spaces or high-dimensional continuous domains (e.g., images)—can be approximated by a sparse linear combination of transition primitives. Formally, given observed transitions or in visual domains, , OTF seeks a dictionary and nonnegative activation vectors satisfying
with a canonical factorization loss
promoting accurate reconstruction, activation sparsity, and diverse primitive representations (Nam et al., 29 Jun 2026).
In network analysis, analogous matrix factorization is posed for an unknown, low-rank Markov operator with observed transitions : with minimization of under orthogonality constraints (Yang et al., 2017).
2. Algorithmic Frameworks: OTF in World Modeling and Network Analysis
In the latent action modeling context, OTF is structured as a two-stage process:
Stage 1: Primitive Extraction
- Input: Sets of state pairs 0 or transitions.
- Motion-centric input 1 (e.g., frame differences or spatial gradients) is encoded patchwise via a learned encoder 2, producing features 3.
- Quantization: Each patch feature is mapped to a codebook vector 4 by nearest-neighbor search, forming code assignments 5 and quantized codes 6.
- Statistical representations: For each code 7, the occupancy map 8 tracks patch spatial assignments, and 9 tracks activation strength.
- A small decoder 0 reconstructs 1 from these factors, optimizing the combined loss for reconstruction, vector quantization, commitment, and code orthogonality.
Stage 2: Latent Action Aggregation (OTF-LAM)
- For each time 2, the frozen OTF tokenizer yields codes, occupancy, and activations.
- Each primitive is embedded (state-aware factor embedding) via a network 3.
- Gating: Factors are softly gated, 4, producing a sparse weighted set.
- Aggregation yields a compact latent action 5 via averaging and optional projection.
- Forward Model: A decoder 6 predicts future state or frame via 7.
- Training minimizes the next-frame prediction error with the OTF tokenizer held fixed.
OTF-LAM-DINO replaces the pixelwise decoder with prediction in a frozen DINOv2 representation space, with loss defined as 8, benefiting from learned, domain-agnostic visual features (Nam et al., 29 Jun 2026).
For online network factorization, a stochastic generalized Hebbian algorithm updates 9 per observed transition, using a projection onto the Stiefel manifold. Under proper step-size schedules and spectral gap assumptions, convergence to principal eigenspaces and optimal sample complexity is achieved (Yang et al., 2017).
3. Empirical Performance and Transferability
OTF provides substantial empirical improvements in factor reusability, policy learning, and network partitioning.
- Zero-shot transfer: OTF primitives transfer robustly across agent morphologies (e.g., Walker→Cheetah) and across visual modes (e.g., MovingMNIST digit classes), with relative MSE degradation ("drop") on the order of 20–50% (depending on transform), compared to 58–72% for monolithic vector quantization methods. This indicates an improved separation between local, reusable transition effects and global, context-specific templates (Nam et al., 29 Jun 2026).
- Policy learning: In downstream task imitation, OTF-LAM and OTF-LAM-DINO demonstrate competitive or superior average returns versus several baselines under distractors. For example, OTF-LAM-DINO achieves 0 on Cheetah-Run (vs. 1 for FLAM-4, 2 for HiLAM) and 3 on Walker-Run (Nam et al., 29 Jun 2026).
- Capacity: Increasing the code vocabulary 4 enhances OTF-LAM performance up to 5 on specific tasks, while OTF-LAM-DINO peaks around 6, reflecting 7 as a tunable capacity rather than a critical hyperparameter.
- Network partitioning: OTF methods recover meaningful city partitions in Manhattan taxi flow, achieving modularity above 8 and tight correspondence with known neighborhoods (Yang et al., 2017).
4. Exact Recovery and Theoretical Guarantees
OTF admits strong guarantees under Markov process lumpability and spectral separation. Specifically:
- Exact recovery: If the Markov chain's partition is lumpable and a sufficient spectral gap is present, OTF achieves exact block recovery with high probability (9) after 0 samples (Yang et al., 2017).
- Sample complexity: To ensure 1 with high probability, 2 transitions suffice.
- Global convergence: Under properly diminishing stepsizes and mixing, stochastic OTF updates converge almost surely to the span of top eigenvectors of 3.
This suggests that in both controlled and complex environments, OTF enables interpretable and reliable structure discovery from transition data, without explicit supervision or prior knowledge of state/action semantics.
5. Implementation and Algorithmic Details
OTF is instantiated via distinct but conceptually unified procedures in world modeling and Markov network contexts.
OTF Primitives (World Modeling):
- Training involves an encoder–decoder pipeline with a learned codebook, quantizing patchwise motion or state-difference signals.
- Sparsity is encoded via 4 loss terms on activations; diversity via pairwise orthogonality penalties.
- The full pipeline is optimized end-to-end until convergence, then encoder and codebook are frozen for downstream tasks.
OTF-LAM and OTF-LAM-DINO:
Algorithmic stages include:
- Tokenization of observed motion or transition via the OTF encoder;
- Embedding, gating, and aggregation of primitive activations;
- Prediction of subsequent state or DINO feature via a learned dynamic model.
Pseudocode for each stage is provided in (Nam et al., 29 Jun 2026). No new statistics, pseudocode, or tool names not present in the original data are introduced.
Markov Chain Setting:
- Upon observing a transition 5, 6 is updated via a projected stochastic gradient with nonconvex objectives.
- Orthonormality is preserved by projection onto the Stiefel manifold (e.g., via QR factorization).
6. Applications, Limitations, and Open Questions
OTF is demonstrated in large-scale partitioning of city traffic networks and in visually complex dynamical systems with distractors or ambiguous transition sources.
- Applications: Traffic region discovery, zero-shot motion code transfer, improved latent action policy learning, and partition recovery in large networks.
- Limitations: Success depends on the adequacy of the primitive vocabulary size 7 and the representational adequacy of patchwise encoders. Performance can be sensitive to the specified transforms (e.g., velocity vs. acceleration, Sobel vs. gradient filters).
- Open questions: How to optimally select or adapt 8 for unseen domains; how to guarantee interpretability of primitives under nonlinear context; and extension to non-Markovian or temporally extended transition factorizations.
7. Connections to Broader Literature
OTF generalizes online matrix factorization paradigms from implicit large-scale networks (Yang et al., 2017) to high-dimensional continuous observation spaces and world modeling for reinforcement learning (Nam et al., 29 Jun 2026).
- Compared to monolithic vector quantization, OTF provides increased transferability and interpretability by imposing spatial, sparse, and orthogonal structuring on primitive codes.
- In comparison to classical spectral clustering, OTF is applicable in settings lacking explicit transition matrices, relying instead on raw transition samples or temporal state observations.
- The OTF-LAM-DINO framework leverages self-supervised vision representation learning (e.g., DINOv2) as a fixed representation space for decoder-free world modeling.
In summary, OTF establishes a scalable, flexible framework for unsupervised structure discovery in dynamically observed systems, bridging graph factorization, visual representation learning, and latent action modeling (Yang et al., 2017, Nam et al., 29 Jun 2026).