Papers
Topics
Authors
Recent
2000 character limit reached

Skill Abstraction from Optical Flow

Updated 30 December 2025
  • Skill Abstraction from Optical Flow is a paradigm that learns temporally extended, action-aligned skill representations from large-scale, action-free video data using optical flow.
  • It employs a convolutional-transformer encoder and finite scalar quantization to generate discrete skill tokens, enabling high-level planning and robust execution.
  • SOF leverages optical flow’s action-centric focus to reduce dependency on labeled data while achieving effective cross-embodiment transfer and performance in complex scenarios.

Skill Abstraction from Optical Flow (SOF) is a paradigm for learning temporally extended, action-aligned skill representations directly from large-scale, action-free video data. By leveraging optical flow—dense estimates of pixel-level motion between consecutive frames—as a privileged mid-level signal, SOF produces latent skill spaces that capture agent-relevant dynamics. This mid-level abstraction enables high-level planning and robust skill extraction from complex video corpora, facilitating the development of scalable embodied agents with minimal reliance on action supervision (Fang et al., 23 Dec 2025, Bu et al., 20 Nov 2025).

1. Mathematical Formulation and Core Principles

SOF is grounded in the learning of discrete, temporally extended skill tokens derived from segmented optical flow. For a given video v=(x1,...,xT)v = (x_1, ..., x_T):

  • Optical flow is computed as ft=FlowEstimator(xt,xt+1)∈RH×W×2f_t = \mathrm{FlowEstimator}(x_t, x_{t+1}) \in \mathbb{R}^{H \times W \times 2}.
  • Subsequences of flows are aggregated over window HH to form flow segments dt:t+H−1=(ft,...,ft+H−1)d_{t:t+H-1} = (f_t, ..., f_{t+H-1}).
  • A convolutional-transformer flow encoder eÏ•e_\phi produces continuous embeddings z=eÏ•(dt:t+H−1)∈Rn×Dz = e_\phi(d_{t:t+H-1}) \in \mathbb{R}^{n \times D}.
  • Embeddings are quantized into KK-level discrete tokens using Finite Scalar Quantization (FSQ): c=FSQ(z)∈{1,...,K}n×Dc = \mathrm{FSQ}(z) \in \{1,...,K\}^{n \times D}.
  • A transformer-based decoder gθg_\theta reconstructs the flow segment conditioned on the initial RGB frame, producing d^t:t+H−1=gθ(embed(c),xt)\hat d_{t:t+H-1} = g_\theta(\mathrm{embed}(c), x_t).
  • The primary training objective is the L1L_1 loss between predicted and true flow,

Lrecon=Ev∼Dvideo∥d^t:t+H−1−dt:t+H−1∥1.\mathcal{L}_{\text{recon}} = \mathbb{E}_{v \sim D_{\rm video}} \lVert \hat d_{t:t+H-1} - d_{t:t+H-1} \rVert_1.

This skill-tokenization allows for high-level action decision-making at the granularity of human-interpretable motion episodes, supporting both planning and downstream skill composition (Fang et al., 23 Dec 2025).

2. Pipeline Architecture and Training Procedure

The SOF framework is structured in three main stages:

  1. Skill Abstraction Pretraining: The quantized flow autoencoder (encoder, FSQ layer, decoder) is trained on unlabeled video datasets, minimizing flow reconstruction error.
  2. Skill Policy Learning: A policy in the skill-token space is trained to generate skill sequences conditioned on current visual observations and, optionally, natural language instructions. This involves autoregressive prediction via a transformer decoder and cross-entropy loss over token sequences.
  3. Flow2Action Mapping: Skill tokens are decoded into predicted flow segments, which are then translated to low-level robot actions either through:
    • Learning-free heuristics (e.g., inferring object SE(3) transformations from flow with hand-engineered heuristics for tasks such as grasping),
    • Or learning-based methods, where a lightweight MLP regresses from decoded flow to action vectors using a small labeled action dataset.

A concise pipeline sequence is:

1
2
3
4
5
6
for each minibatch in video_dataset:
    flows = compute_optical_flows(batch)
    z = encoder(flows)
    skill_tokens = FSQ(z)
    reconstructed_flows = decoder(skill_tokens, initial_rgb_frame)
    optimize(L1_loss(reconstructed_flows, flows))

Skill-policy training and planning are similarly formulated as sequential transformer-based prediction in skill-token space, followed by decoding and action mapping (Fang et al., 23 Dec 2025).

3. Advantages of Optical Flow for Skill Discovery

Optical flow as a supervisory signal in SOF offers distinct benefits:

  • Action-Centricity: Flow encodes agent-controllable scene regions, highlighting moving objects and suppressing static or irrelevant background, thus filtering out distractors.
  • Stability Under Label Scarcity: Self-supervision via flow constrains latent dynamics, mitigating overfitting and reducing variance in settings with 0–1% action-labeled data. Empirical ablations confirm substantial downstream gains by adding flow constraints to baseline latent-action models (Bu et al., 20 Nov 2025).
  • Scene Adaptivity: Object-centric flow masking (e.g., using LangSAM for segmenting agent-related motion) further isolates agent-induced displacement, ensuring that latent skills correspond to meaningful behaviors.

This enables robust and stable representation learning in real-world and challenging simulated domains such as LIBERO and PROCGEN (Bu et al., 20 Nov 2025).

4. High-Level Planning, Skill Composition, and Execution

SOF skill tokens are temporally extended and discretized, suitable for high-level planning:

  • Skill-Policy Deployment: At test time, an autoregressive policy samples or beam-searches skill token sequences conditioned on the current observation and (optionally) text instruction.
  • Skill Execution: Each token directs a sequence of low-level actions, executed in open-loop for HH timesteps, after which replanning can occur for increased adaptivity.
  • Transfer and Composition: The abstraction enables cross-embodiment transfer—skill trajectories learned from one robot arm (e.g., Panda) can be mapped to another (e.g., Sawyer) with minimal adjustment, maintaining skill-semantic consistency and high transfer success rates (Fang et al., 23 Dec 2025).

5. Quantitative Benchmarks and Empirical Comparisons

SOF and closely related methods such as LAOF have demonstrated superior performance across benchmarks:

Benchmark Baseline (BC/DP/LAPO) SOF/LAOF (Unsupervised) SOF/LAOF (Action-Supervised)
LIBERO-Long 44.7% +4.2 pp +11.5 pp
PROCGEN-Chaser 0.11 +0.16 +0.22
MetaWorld Avg. 0.57 (BC) 0.69 —

SOF attains 0.69 multi-task average success on MetaWorld (vs. 0.57 for behavioral cloning and 0.14–0.42 for alternative latent policies). In cross-embodiment transfer, zero-shot transfer is above 60% for unseen arms, with nearly identical skill token traces between robot morphologies. SOF requires 3× less labeled data to match or exceed the performance of baseline policies (Fang et al., 23 Dec 2025, Bu et al., 20 Nov 2025).

6. Limitations and Open Challenges

SOF presents several open challenges:

  • Flow Estimator Quality: Performance is sensitive to the accuracy of flow estimates. Occlusions, camera shake, and 2D-to-3D ambiguities can degrade skill abstraction; future improvements in flow estimation will benefit SOF pipelines.
  • Small Object and Large Rotation Tasks: Tasks involving fine manipulation or significant part rotation remain challenging, with lower absolute success.
  • Camera Viewpoint Limitations: Fixed camera viewpoints are assumed; multi-view or scene-flow integration is a plausible extension.
  • Action Mapping Heuristics: The learning-free Flow2Action approach requires access to depth and correct segmentation, while learning-based regression reduces assumptions but requires labeled data.
  • Generalization Beyond Third-Person: Adapting SOF to egocentric perspectives or unstructured environments, as well as incorporating modalities beyond vision (e.g., tactile or depth), remains an open problem (Fang et al., 23 Dec 2025, Bu et al., 20 Nov 2025).

7. Significance, Extensions, and Future Directions

SOF demonstrates that mid-level motion representations, specifically temporally extended, discretized optical-flow segments, are a scalable substrate for skill discovery from action-free videos. This abstraction supports:

  • Efficient pretraining on web-scale visual data without requiring dense action annotation,
  • Modular separation of high-level skill planning and low-level action execution,
  • Strong practical performance and transferability across domains and robot morphologies,
  • Promising foundations for large-scale, general-purpose embodied agents.

Open research avenues include improved flow-aware skill segmentation, 3D scene-flow or multi-modal integration, extension to real-world deployments in uncontrolled settings, and minimizing the action annotation requirement for skill-to-action grounding (Fang et al., 23 Dec 2025, Bu et al., 20 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Skill Abstraction from Optical Flow (SOF).