NovaFlow: Zero-Shot Robotic Manipulation
- NovaFlow is an autonomous zero-shot manipulation framework that converts high-level instructions into a dense 3D object flow representation.
- It synthesizes video-based dynamics using pretrained models and off-the-shelf perception to generate actionable plans from language and vision inputs.
- Its embodiment-agnostic design enables robots to transfer policies across platforms without demonstrations by leveraging precise trajectory optimization.
NovaFlow is an autonomous, demonstration-free, zero-shot manipulation framework designed to enable robots to execute novel tasks by converting high-level task descriptions directly into actionable plans. Distinct from prior techniques dependent on in-distribution tasks or embodiment-matched data for fine-tuning, NovaFlow employs video generation models and off-the-shelf perception modules to synthesize and extract a dense, 3D object flow representation. This representation serves as an intermediate abstraction, allowing natural transferability of manipulation policies across different robotic platforms without any demonstrations or robot-specific retraining. The core architecture consists of a Flow Generator that produces a 3D object motion from language instructions and vision, and a Flow Executor that derives robot actions from the object flow, supporting both rigid and deformable object manipulation.
1. Framework Structure and Core Modules
NovaFlow's operational pipeline is bifurcated into two principal modules: the Flow Generator and the Flow Executor. The Flow Generator translates natural language instructions (optionally augmented by goal imagery) with initial RGB-D observations into a 3D encapsulation of object motion. This module utilizes large-scale pretrained video generation models to fabricate video sequences that implicitly capture commonsense gesture and task progressions pertinent to the described manipulation. Subsequently, these synthetic dynamics are distilled into an actionable, spatial-temporal 3D flow.
The Flow Executor ingests the 3D object flow from the generator and computes robot trajectories by converting the intermediate representation into relative object poses. It then employs procedural approaches—such as inverse kinematics (IK), grasp proposals from external models, and trajectory optimization—to realize these poses as actionable robot commands. The entire design decouples high-level task reasoning from low-level execution, facilitating generalized applicability across platforms.
2. Video-Based Dynamics Synthesis and 3D Object Flow Extraction
The process initiates with video synthesis. For a given task prompt and initial RGB image (with associated ground-truth depth), NovaFlow invokes a video generation model—either an image-to-video (I2V) or first-last-frame-to-video (FLF2V) model, contingent on goal image availability—to create a video sequence that encapsulates the prescribed manipulation. This video sequence reflects object motion and task evolution derived from large-scale commonsense priors.
The pipeline then “lifts” video frames to 3D by applying a monocular depth estimation model, calibrating depth by matching the median depth of the initial estimated frame to the ground-truth measurement for scale rectification. A set of keypoints are uniformly sampled in the first frame; depth and camera intrinsics project these into 3D, and a pretrained tracking module traces their motion across frames, yielding dense per-point 3D trajectories. Object grounding, via open-vocabulary detectors and segmentation such as Grounded-SAM2, isolates points correlated to the target object. The resulting actionable object flow is formalized as
where is the temporal extent and the number of tracked keypoints.
3. Robot Action Realization for Rigid Objects
For rigid object manipulation, NovaFlow computes a sequence of robot actions through rigid pose estimation and trajectory generation. Starting from keypoints () and their tracked positions (), the framework solves for the optimal rigid transformation—rotation and translation —using the Kabsch algorithm:
where and denote the keypoint centroids. The object pose is encoded as a homogeneous transformation matrix , and a grasp transformation is produced by a grasp proposal model. The end-effector pose at time is then:
These 6-DOF poses are converted into robot joint commands via trajectory optimization, employing non-linear least-squares solvers such as Levenberg–Marquardt to enforce smoothness, collision avoidance, and joint limits. This approach directly grounds high-level, object-centric motion into executable trajectories.
4. Model-Based Planning for Deformable Object Manipulation
NovaFlow adapts its plan execution strategy when handling deformable objects by leveraging the dense 3D flow as a tracking prescription rather than a direct pose objective. The deformable entity is modeled as a set of particles , whose evolution is predicted by a parameterized particle-based dynamics model responsive to robot actions. The planning objective is articulated as a cost function:
where are the target positions from the object flow. NovaFlow applies model-predictive control (MPC) to minimize cumulative tracking error over a horizon :
This process allows the robot to exploit the dense motion cues offered by synthesized video for precise planning even in non-rigid scenarios.
5. Embodiment-Agnostic Transfer and Modularity
NovaFlow’s architectural design explicitly decouples task comprehension from low-level motor control by establishing the 3D object flow as a modular, object-centric, and embodiment-agnostic intermediate representation. This abstraction enables any robotic platform—be it a fixed-base manipulator like the Franka arm or a mobile quadruped such as the Spot robot—to interpret flow and realize corresponding actions using modality-appropriate controller stacks (IK solvers, grasp planners, trajectory optimizers). This modularity eliminates the necessity for robot-specific policy learning from demonstrations and allows for robust transfer across diverse embodiments and camera configurations, subject to proper hand–eye calibration.
6. Empirical Validation and Benchmarking
NovaFlow was evaluated on multiple table-top and mobile platforms including the Franka arm with Robotiq gripper and the Spot quadruped. Task scenarios encompassed rigid manipulations (hanging a mug, peg-in-hole block insertion, cup-on-saucer placement), articulated motion (drawer opening), deformable manipulation (rope straightening), and linguistically mediated manipulation (plant watering). In quantitative and qualitative assessments, NovaFlow achieved high success rates in zero-shot execution, outperforming alternative methods such as VidBot and AVDC (zero-shot baselines) and surpassing demonstration-conditioned Imitation Learning approaches (e.g., Diffusion Policy, inverse dynamics models trained on small demonstration sets). This suggests the efficacy of the intermediate flow-based pipeline and the informational richness of video-derived task abstraction.
7. Applicability, Challenges, and Future Directions
NovaFlow’s methodology is inherently suitable for application domains requiring flexible, scalable object manipulation—such as domestic service robotics, warehouse automation, and industrial process adaptation—without reliance on exhaustive embodiment-specific data. The ability to leverage video generation for commonsense task understanding allows NovaFlow to function in dynamic, unstructured environments.
Nevertheless, several limitations persist. The most frequent operational failures are associated not with flow extraction but with physical execution (grasping errors, trajectory imprecision, collisions, or slippage), typically arising from the open-loop pipeline and accumulation of execution inaccuracies. Integration of real-time closed-loop feedback, online object tracking, and adaptive replanning are indicated as promising directions to enhance robustness in scenarios demanding precise physical interaction.
In summary, NovaFlow introduces a new paradigm for robotic manipulation predicated on an intermediate 3D object flow abstraction synthesized from video and language. By decoupling perceptual task reasoning from control and demonstrating transferability and efficacy across platforms and manipulation archetypes, NovaFlow advances the state of zero-shot robotic task acquisition and signals future research toward closed-loop, adaptive manipulation strategies.