Papers
Topics
Authors
Recent
2000 character limit reached

HinFlow: Flow-Conditioned Imitation for Robotics

Updated 29 December 2025
  • The paper introduces a hierarchical framework where a high-level flow planner extracts semantically rich subgoals from action-free videos, reducing dependence on expert data.
  • A goal-conditioned low-level imitation policy, paired with an online hindsight relabeling mechanism, ensures dense supervision and stability during training.
  • Empirical results highlight significant performance gains in simulation and real-world tasks, with success rates reaching up to 95% and efficient cross-embodiment transfer.

Hindsight Flow-conditioned Online Imitation (HinFlow) is a hierarchical robot learning framework designed to address the limitations of low-level policy learning when high-quality, labeled robot demonstrations are scarce. HinFlow employs a high-level planner based on 2D point flows, trained on in-the-wild, action-free videos, to propose semantically rich short-horizon subgoals. A low-level, goal-conditioned imitation policy then grounds these flows into executable robot actions. Central to HinFlow is an online, hindsight relabeling mechanism, which annotates collected rollouts with retrospectively achieved goals and efficiently aggregates these relabeled experiences to update the policy via supervised behavioral cloning. This approach yields stable, scalable, and transferable performance in both simulated and real-world manipulation tasks, while dramatically reducing the need for task- and robot-specific expert data (Zheng et al., 22 Dec 2025).

1. Motivation and Hierarchical Decomposition

Standard end-to-end imitation learning for robotics is bottlenecked by the cost and impracticality of collecting extensive, high-quality state-action pairs on physical robots. In contrast, large-scale vision–LLMs exploit vast datasets of unlabeled behavior, unattainable for robotics given labeling logistics. HinFlow addresses this challenge by adopting a hierarchical paradigm: it decomposes control into (1) a high-level planner that outputs abstract, temporally extended subgoals ("point flows"), and (2) a low-level policy that is conditioned on these subgoals to synthesize closed-loop actions.

The key insight is that point flow planners can be trained on large, action-free videos, capturing generalizable manipulation intent without requiring robot-specific data. The subsequent challenge becomes reliably grounding these abstract flows in executable robot actions, which HinFlow solves via hindsight relabeling and online policy improvement.

2. Core Algorithmic Components

2.1 State, Action, and Goal Formalization

At timestep tt, the agent observes a state st=(ot,pt)s_t = (o_t, p_t), where oto_t represents all images (third-person and wrist cameras), and ptp_t encodes proprioceptive signals (such as joint angles). The agent selects an action ata_t from the continuous action space, directed toward achieving a high-level goal gtg_t.

2.2 Point Flows as High-Level Goals

Point flows are multi-point trajectories over a short time horizon in image space. For KK tracked image points {pt,k}k=1K\{p_{t,k}\}_{k=1}^K, a video tracker Φ\Phi extrapolates their path for HH future steps, producing {pt+i,k}i=1H\{p_{t+i,k}\}_{i=1}^H. The planner Fflow\mathbf{F}_{\rm flow} predicts these flows given current observations: gt={p^t+1,k,,p^t+H,k}k=1K=Fflow(ot,{pt,k};ξ).g_t = \left\{ \hat{p}_{t+1,k},\ldots,\hat{p}_{t+H,k} \right\}_{k=1}^K = \mathbf{F}_{\rm flow}(o_t, \{p_{t,k}\}; \xi).

2.3 Goal-Conditioned Policy

The low-level policy is modeled as

πθ(atst,gt)=πθ(atot,pt,{p^t+i,k}).\pi_\theta(a_t \mid s_t, g_t) = \pi_\theta(a_t \mid o_t, p_t, \{ \hat{p}_{t+i,k} \}).

This formulation enables direct conditioning on temporally extended visual subgoals.

3. Hindsight Relabeling and the Online Loop

After policy rollout for an episode τ={(o1,a1),,(oT,aT)}\tau = \{(o_1, a_1), \ldots, (o_T, a_T)\}, HinFlow applies the tracker Φ\Phi to extract the actually achieved flows g~t=ϕ(τ;t)\tilde{g}_t = \phi(\tau; t) for all tt, producing a hindsight-relabeled dataset. Each tuple (ot,at,g~t)(o_t, a_t, \tilde{g}_t) then becomes a valid supervised training example for updating the policy.

The online algorithm iterates as follows:

  1. Train the flow planner Fflow\mathbf{F}_{\rm flow} on unlabeled videos Dh\mathcal{D}_h.
  2. Pretrain the policy πθ\pi_\theta on a small expert-labeled set Da\mathcal{D}_a using behavioral cloning.
  3. For each episode:
    • Use the planner to compute gtg_t.
    • Roll out the policy for TT steps, collecting trajectory τ\tau.
    • For each tt, relabel with hindsight goals g~t=ϕ(τ;t)\tilde{g}_t = \phi(\tau; t).
    • Add relabeled tuples to the replay buffer Dr\mathcal{D}_r.
    • Sample a minibatch from Dr\mathcal{D}_r and update θ\theta via log-likelihood minimization: L(θ)=E(ot,at,g~t)Dr[logπθ(atot,g~t)].\mathcal{L}(\theta) = \mathbb{E}_{(o_t, a_t, \tilde{g}_t) \sim \mathcal{D}_r} \left[ -\log \pi_\theta(a_t \mid o_t, \tilde{g}_t) \right].

4. Theoretical Properties and Practical Consequences

Flows serve as low-dimensional, semantically meaningful representations, robust to visual variations such as color and lighting. Hindsight relabeling guarantees dense supervision: every encountered state-action pair is paired with a self-consistent goal, circumventing issues of sparse or deceptive rewards. By re-casting policy improvement as supervised learning, HinFlow avoids instability from exploration and credit assignment that plagues reinforcement learning. The off-policy, aggregation-based training enables continual adaptation as new environment data is collected.

5. Empirical Performance and Comparative Evaluation

HinFlow outperforms prior imitation and reinforcement learning methods on both simulated and real-world manipulation benchmarks. In LIBERO (four tasks) and ManiSkill3 (three tasks), with performance measured as success rate (fraction of successful episodes), HinFlow attains approximately 84% average success in simulation within 80,000 environment steps—over twice the best static BC or ATM policy, and 1.45× above the strongest competing baseline. On a physical grasp-and-place benchmark, online adaptation improves success from 40% to 95% in around 10,000 steps. ATM (seg) fails to adapt, and online–VPT's supervision degrades due to noisy inverse dynamics labels; PPO does not achieve more than 10% success in 80,000 steps despite dense reward signals.

Method Simulation Success Rate Real Robot Improvement
HinFlow ~84% (80K steps) 40% → 95% (10K steps)
ATM (best) ≤ 42% (static) Not adaptive
PPO ≤ 10% (80K steps) Not evaluated

6. Ablation Studies and Sensitivity Analyses

Experiments varying the number of expert demonstrations (0, 1, 3 in LIBERO; 0, 2, 5, 10 in ManiSkill) demonstrate that, while zero demonstrations precludes initial exploration, providing at least one enables HinFlow to recover and approach optimal performance. Final success rates show insensitivity to the exact number of expert seeds beyond the minimal threshold. Varying the flow horizon HH over {4,8,12,16}\{4, 8, 12, 16\} establishes that very short horizons (e.g., H=4H = 4) are insufficient, while horizons of at least eight frames yield stable, high performance.

7. Transferability from Cross-Embodiment Video Data

HinFlow exploits cross-embodiment transfer: flow planners trained on hundreds of action-free videos of, for example, the Franka robot, combined with as few as five Kinova demonstrations, enable low-level policy learning on the target Kinova platform. Empirical results on the "Place Book" task show success improvements from 0.6% (no cross-embodiment initialization) to 48.1% (with), and for "Poke Cube," from 24.4% to 61.3%. This suggests HinFlow can leverage heterogeneous, action-free video data to deliver significant transfer gains with minimal new expert input.


HinFlow integrates hierarchical planning via image-space flows, goal-conditioned imitation policies, and intensive hindsight relabeling in the online loop. This system demonstrates scalable, adaptive robot learning that is robust to demonstration scarcity and capable of transferring manipulation skills across embodiments when equipped with video-only supervision (Zheng et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hindsight Flow-conditioned Online Imitation (HinFlow).