- The paper proposes a dual-system architecture where a diffusion model forecasts high-level pixel motion (System 2) and a mapping module (System 1) converts it into actionable robot commands.
- It leverages a universal pixel motion representation computed via self-supervised methods, offering an interpretable and scalable link between vision, language, and robot actions.
- Experimental results in simulation and real-world tasks demonstrate LangToMo’s superior performance over baselines while reducing the need for extensive action trajectory annotations.
The paper "Pixel Motion as Universal Representation for Robot Control" (2505.07817) introduces LangToMo, a vision-language-action framework designed to enable flexible robot control from natural language instructions by using pixel motion as a universal intermediate representation. The core idea is to decouple high-level motion generation from low-level action execution through a dual-system architecture.
Core Concept: Pixel Motion Representation
LangToMo utilizes pixel motion (optical flow between frames) as its universal representation for robot actions. The authors argue that pixel motion is:
- Universal: Agnostic to specific robot embodiments, viewpoints, and tasks.
- Interpretable: Visually intuitive, showing what should move and how.
- Motion-centric: Directly captures the essence of motion required to achieve a goal.
- Scalable Learning: Can be computed using self-supervised methods like RAFT (Werninghaus et al., 2020) from readily available web-scale video-caption data, bypassing the need for expensive action labels or dense pixel-level annotations.
Dual-System Architecture (LangToMo)
The framework is structured as a dual-system:
- System 2: Pixel Motion Forecast (High-level Controller)
- This module is a conditional image diffusion model (Ho et al., 2020, Nichol et al., 2021).
- It takes a single image frame (current state), a language instruction (as an embedding), and the pixel motion from the previous time step as input.
- It generates a sequence of pixel motion forecasts representing the desired motion from the current state towards the goal described by the language instruction.
- The multi-modal nature of forecasting future motion from a single frame and language (one instruction/frame could lead to multiple valid motion sequences) makes diffusion models suitable for this task.
- System 2 operates at sparse temporal intervals (k), providing high-level guidance.
- The model is trained using a denoising objective on the difference between predicted and ground-truth pixel motion (computed by RAFT) from video-caption pairs. Conditional inputs (image and previous flow) are not noised during training.
- System 1: Pixel Motion to Action Mapping (Low-level Controller)
- This module translates the pixel motion sequence generated by System 2 into executable robot action vectors.
- Action vectors are embodiment-specific, so System 1 is designed as a task-specific mapping function.
- System 1 operates at dense temporal intervals (j<k), allowing for precise, reactive control driven by the sparser high-level motion plan.
- Two instantiations are explored:
- Learned Mapping (LTM-S): A lightweight neural network (vision transformer) is trained on a limited number of expert action trajectories for the specific task. It takes the predicted pixel motion, current state image, and current intermediate state image as input.
- Hand-Crafted Mapping (LTM-H): Leverages the interpretability of pixel motion to create rule-based mappings. This can involve using ground-truth segmentation/depth (in simulation, similar to (Ko et al., 2023)) or making planar assumptions and using visual geometry (in real-world tabletop tasks, similar to (Li et al., 28 Jun 2024)) to convert 2D pixel motion into 3D robot movements. This allows for unsupervised control without any action trajectory labels.
Practical Implementation Details
- Data: System 2 is pretrained on large-scale video-caption data (subset of OpenX (Collaboration et al., 2023)) and can be optionally fine-tuned on task-specific video data. System 1 (learned variant) requires a limited amount of expert demonstrations (action trajectories) for fine-tuning.
- System 2 Architecture: A modified 2D conditional U-Net (Ronneberger et al., 2015) is used. Input layers handle 7 channels (3 for current image, 2 for previous motion, 2 for noisy target motion) and output layers produce 2 channels (predicted clean motion). Language conditioning is applied via cross-attention using embeddings from a Universal Sentence Encoder (Cer et al., 2018).
- Training: Standard diffusion denoising objective. The previous pixel motion input is sometimes corrupted with noise during training to improve robustness. Zero motion targets are introduced for terminal states to signal task completion.
- Inference: System 2 uses a DDIM scheduler (Song et al., 2020) (e.g., 25 steps) to generate the motion forecast. The predicted motion from one System 2 step is used as the "previous motion" input for the next System 2 step, creating a sequence. System 1 is run multiple times (e.g., 10 steps) for each single motion forecast from System 2.
- Hardware: Pretraining requires significant resources (e.g., 8 A100 GPUs), while fine-tuning and System 1 training are less demanding (e.g., 4 A5000 or 1 A5000).
Experimental Results
LangToMo was evaluated on 15 tasks across simulated (MetaWorld (Krause et al., 2019)) and real-world (xArm tabletop) environments.
- MetaWorld: Both LTM-H (hand-crafted System 1) and LTM-S (learned System 1) achieved strong overall success rates (52.1% and 53.6%, respectively), outperforming several baselines including BC variants, UniPi, and AVDC variants (Ko et al., 2023) (see Table 1). This highlights the effectiveness of the learned pixel motion representation and the hierarchical structure.
- Real-World: LangToMo (LTM-H) demonstrated strong performance on 4 challenging tabletop tasks (see Table 2, Table 3). It outperformed state-of-the-art methods like RT-2 (Brohan et al., 2023) and LLaRA (Li et al., 28 Jun 2024), notably without requiring action trajectory labels during training for System 2. The authors also showed that training System 2 on human demonstrations (not just robot ones) is beneficial (see Table 4 in Appendix).
- Ablations: Studies confirmed the importance of all System 2 inputs (image, language, previous flow). Replacing the diffusion model or altering the conditioning strategy significantly reduced performance. Running System 1 at the same frequency as System 2 or bypassing the intermediate motion representation (training System 1 directly) also led to worse results, validating the hierarchical and two-stage design choices (see Table 5).
Limitations
The paper identifies several practical limitations:
- System 1 Cost: Developing or collecting data for task-specific System 1 mappings (either learned or hand-crafted) can still be costly for each new downstream task.
- 2D Representation: The pixel motion representation is currently 2D and lacks depth information, limiting its applicability to tasks requiring precise 3D reasoning.
- Inference Speed: Diffusion models are computationally expensive at inference time, which might restrict deployment in resource-constrained environments.
- Ego Motion: System 2 training currently assumes fixed-camera videos without ego motion, limiting scalability to arbitrary video data.
Conclusion
LangToMo effectively bridges language, vision, and action by forecasting universal pixel motion representations from video-caption data using diffusion models (System 2) and translating these motions into actions via task-specific mappings (System 1). The framework demonstrates strong performance across various tasks and environments, highlighting the potential of universal, interpretable motion representations for scalable and generalizable robot learning without relying on expensive action trajectory supervision for the core motion generation module. Future work includes addressing the limitations regarding task-specific mappings, 3D representation, computational efficiency, and ego motion handling.