Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 168 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 28 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 122 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

NovaFlow: Zero-Shot Robotic Manipulation

Updated 14 October 2025
  • NovaFlow is an autonomous zero-shot manipulation framework that converts high-level instructions into a dense 3D object flow representation.
  • It synthesizes video-based dynamics using pretrained models and off-the-shelf perception to generate actionable plans from language and vision inputs.
  • Its embodiment-agnostic design enables robots to transfer policies across platforms without demonstrations by leveraging precise trajectory optimization.

NovaFlow is an autonomous, demonstration-free, zero-shot manipulation framework designed to enable robots to execute novel tasks by converting high-level task descriptions directly into actionable plans. Distinct from prior techniques dependent on in-distribution tasks or embodiment-matched data for fine-tuning, NovaFlow employs video generation models and off-the-shelf perception modules to synthesize and extract a dense, 3D object flow representation. This representation serves as an intermediate abstraction, allowing natural transferability of manipulation policies across different robotic platforms without any demonstrations or robot-specific retraining. The core architecture consists of a Flow Generator that produces a 3D object motion from language instructions and vision, and a Flow Executor that derives robot actions from the object flow, supporting both rigid and deformable object manipulation.

1. Framework Structure and Core Modules

NovaFlow's operational pipeline is bifurcated into two principal modules: the Flow Generator and the Flow Executor. The Flow Generator translates natural language instructions (optionally augmented by goal imagery) with initial RGB-D observations into a 3D encapsulation of object motion. This module utilizes large-scale pretrained video generation models to fabricate video sequences that implicitly capture commonsense gesture and task progressions pertinent to the described manipulation. Subsequently, these synthetic dynamics are distilled into an actionable, spatial-temporal 3D flow.

The Flow Executor ingests the 3D object flow from the generator and computes robot trajectories by converting the intermediate representation into relative object poses. It then employs procedural approaches—such as inverse kinematics (IK), grasp proposals from external models, and trajectory optimization—to realize these poses as actionable robot commands. The entire design decouples high-level task reasoning from low-level execution, facilitating generalized applicability across platforms.

2. Video-Based Dynamics Synthesis and 3D Object Flow Extraction

The process initiates with video synthesis. For a given task prompt and initial RGB image (with associated ground-truth depth), NovaFlow invokes a video generation model—either an image-to-video (I2V) or first-last-frame-to-video (FLF2V) model, contingent on goal image availability—to create a video sequence that encapsulates the prescribed manipulation. This video sequence reflects object motion and task evolution derived from large-scale commonsense priors.

The pipeline then “lifts” video frames to 3D by applying a monocular depth estimation model, calibrating depth by matching the median depth of the initial estimated frame to the ground-truth measurement for scale rectification. A set of keypoints are uniformly sampled in the first frame; depth and camera intrinsics project these into 3D, and a pretrained tracking module traces their motion across frames, yielding dense per-point 3D trajectories. Object grounding, via open-vocabulary detectors and segmentation such as Grounded-SAM2, isolates points correlated to the target object. The resulting actionable object flow is formalized as

FRT×M×3\mathcal{F} \in \mathbb{R}^{T \times M \times 3}

where TT is the temporal extent and MM the number of tracked keypoints.

3. Robot Action Realization for Rigid Objects

For rigid object manipulation, NovaFlow computes a sequence of robot actions through rigid pose estimation and trajectory generation. Starting from keypoints (fi1f_i^1) and their tracked positions (fitf_i^t), the framework solves for the optimal rigid transformation—rotation RtR^t and translation ttt^t—using the Kabsch algorithm:

Rt=argminRSO(3)iR(fi1c1)(fitct)2R^t = \arg\min_{R \in SO(3)} \sum_i \| R (f_i^1 - c^1) - (f_i^t - c^t) \|^2

tt=ctRtc1t^t = c^t - R^t c^1

where c1c^1 and ctc^t denote the keypoint centroids. The object pose is encoded as a homogeneous transformation matrix TobjtT_{\text{obj}}^t, and a grasp transformation TgraspT_{\text{grasp}} is produced by a grasp proposal model. The end-effector pose at time tt is then:

Teet=TobjtTgraspT_{\text{ee}}^t = T_{\text{obj}}^t \cdot T_{\text{grasp}}

These 6-DOF poses are converted into robot joint commands via trajectory optimization, employing non-linear least-squares solvers such as Levenberg–Marquardt to enforce smoothness, collision avoidance, and joint limits. This approach directly grounds high-level, object-centric motion into executable trajectories.

4. Model-Based Planning for Deformable Object Manipulation

NovaFlow adapts its plan execution strategy when handling deformable objects by leveraging the dense 3D flow as a tracking prescription rather than a direct pose objective. The deformable entity is modeled as a set of particles St={sit}i=1Np\mathcal{S}_t = \{ s_i^t \}_{i=1}^{N_p}, whose evolution is predicted by a parameterized particle-based dynamics model fθf_\theta responsive to robot actions. The planning objective is articulated as a cost function:

C(St,Ft)=isitfit2C(\mathcal{S}_t, \mathcal{F}^t) = \sum_i \| s_i^t - f_i^t \|^2

where fitf_i^t are the target positions from the object flow. NovaFlow applies model-predictive control (MPC) to minimize cumulative tracking error over a horizon HH:

At=argminAtj=tt+H1C(Sj,Fj)subject toSj+1=fθ(Sj,aj)A^*_t = \arg\min_{A_t} \sum_{j=t}^{t+H-1} C(\mathcal{S}_j, \mathcal{F}^j) \quad \text{subject to} \quad \mathcal{S}_{j+1} = f_\theta(\mathcal{S}_j, a_j)

This process allows the robot to exploit the dense motion cues offered by synthesized video for precise planning even in non-rigid scenarios.

5. Embodiment-Agnostic Transfer and Modularity

NovaFlow’s architectural design explicitly decouples task comprehension from low-level motor control by establishing the 3D object flow as a modular, object-centric, and embodiment-agnostic intermediate representation. This abstraction enables any robotic platform—be it a fixed-base manipulator like the Franka arm or a mobile quadruped such as the Spot robot—to interpret flow and realize corresponding actions using modality-appropriate controller stacks (IK solvers, grasp planners, trajectory optimizers). This modularity eliminates the necessity for robot-specific policy learning from demonstrations and allows for robust transfer across diverse embodiments and camera configurations, subject to proper hand–eye calibration.

6. Empirical Validation and Benchmarking

NovaFlow was evaluated on multiple table-top and mobile platforms including the Franka arm with Robotiq gripper and the Spot quadruped. Task scenarios encompassed rigid manipulations (hanging a mug, peg-in-hole block insertion, cup-on-saucer placement), articulated motion (drawer opening), deformable manipulation (rope straightening), and linguistically mediated manipulation (plant watering). In quantitative and qualitative assessments, NovaFlow achieved high success rates in zero-shot execution, outperforming alternative methods such as VidBot and AVDC (zero-shot baselines) and surpassing demonstration-conditioned Imitation Learning approaches (e.g., Diffusion Policy, inverse dynamics models trained on small demonstration sets). This suggests the efficacy of the intermediate flow-based pipeline and the informational richness of video-derived task abstraction.

7. Applicability, Challenges, and Future Directions

NovaFlow’s methodology is inherently suitable for application domains requiring flexible, scalable object manipulation—such as domestic service robotics, warehouse automation, and industrial process adaptation—without reliance on exhaustive embodiment-specific data. The ability to leverage video generation for commonsense task understanding allows NovaFlow to function in dynamic, unstructured environments.

Nevertheless, several limitations persist. The most frequent operational failures are associated not with flow extraction but with physical execution (grasping errors, trajectory imprecision, collisions, or slippage), typically arising from the open-loop pipeline and accumulation of execution inaccuracies. Integration of real-time closed-loop feedback, online object tracking, and adaptive replanning are indicated as promising directions to enhance robustness in scenarios demanding precise physical interaction.

In summary, NovaFlow introduces a new paradigm for robotic manipulation predicated on an intermediate 3D object flow abstraction synthesized from video and language. By decoupling perceptual task reasoning from control and demonstrating transferability and efficacy across platforms and manipulation archetypes, NovaFlow advances the state of zero-shot robotic task acquisition and signals future research toward closed-loop, adaptive manipulation strategies.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to NovaFlow.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube