Learning by Watching: A Review of Video-based Learning Approaches for Robot Manipulation
Abstract: Robot learning of manipulation skills is hindered by the scarcity of diverse, unbiased datasets. While curated datasets can help, challenges remain in generalizability and real-world transfer. Meanwhile, large-scale "in-the-wild" video datasets have driven progress in computer vision through self-supervised techniques. Translating this to robotics, recent works have explored learning manipulation skills by passively watching abundant videos sourced online. Showing promising results, such video-based learning paradigms provide scalable supervision while reducing dataset bias. This survey reviews foundations such as video feature representation learning techniques, object affordance understanding, 3D hand/body modeling, and large-scale robot resources, as well as emerging techniques for acquiring robot manipulation skills from uncontrolled video demonstrations. We discuss how learning only from observing large-scale human videos can enhance generalization and sample efficiency for robotic manipulation. The survey summarizes video-based learning approaches, analyses their benefits over standard datasets, survey metrics, and benchmarks, and discusses open challenges and future directions in this nascent domain at the intersection of computer vision, natural language processing, and robot learning.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, actionable list of what remains missing, uncertain, or unexplored in video-based learning for robot manipulation as surveyed in the paper.
- Lack of standardized, manipulation-centric benchmarks for learning purely from passive “in-the-wild” videos, including common task suites, evaluation protocols, and metrics for generalization across environments and embodiments.
- No controlled, apples-to-apples comparisons of representation learning methods (e.g., TCN, MAE, R3M, CVRL, DPC) under identical robot manipulation settings to isolate which pretraining objectives, modalities, and data sources most improve downstream policy performance.
- Limited methods for robustly aligning unstructured human videos to robot action spaces: extracting reliable 6-DoF object/hand poses, contacts, and forces from arbitrary viewpoints without dense annotations or motion capture.
- Affordance grounding remains fragile under occlusion, clutter, and domain shift; uncertainty-aware affordance maps and actionable confidence measures are not standardized or evaluated by task success.
- Sparse evidence on handling deformable, articulated, transparent, or reflective objects using video-only pretraining; need targeted datasets and evaluation for these hard categories.
- Embodiment gap remains under-quantified: general, systematic mappings from human hands to diverse robot end-effectors (parallel grippers, suction, soft hands, bimanual arms) and their effect on task success are not well studied.
- Unsupervised temporal segmentation and subgoal discovery from passive videos is immature; methods to extract executable, composable skills with minimal labels and strong real-robot validation are missing.
- Scalable reward inference from videos (inverse RL/preference learning) is unresolved: how to learn reliable rewards from noisy, uncurated internet data while respecting safety constraints.
- Lack of causal reasoning: current models often learn correlations from tutorials; methods to infer causally necessary actions and counterfactuals from video, with causal evaluation protocols, are absent.
- Multimodal fusion beyond vision-language is underexplored; integrating audio, narration, eye gaze, EMG/tactile proxies to improve action inference lacks standardized datasets and ablation studies.
- Data quality, bias, and licensing in internet corpora are unaddressed at scale; reproducible curation pipelines, debiasing strategies, and ethical guidelines for using human videos in robotics are needed.
- Real-time control constraints: VLM/VLA backbones introduce inference latency; systematic approaches for low-latency perception-action pipelines (e.g., distillation, scheduling, hardware co-design) are not benchmarked.
- Safety-aware learning from videos is largely missing; detecting unsafe actions, enforcing compliance and human-proximity constraints, and providing formal safety guarantees during policy execution remain open.
- Long-horizon task composition is limited; strategies to combine video-derived primitives with hierarchical planners, subgoal validation, and recovery/backtracking require rigorous studies.
- Generalization testing is narrow; robust field trials in diverse homes/workplaces and standardized out-of-distribution protocols (scenes, objects, tasks, viewpoints) are lacking.
- Cross-view learning (ego ↔ exo) is not mature; consistent 3D metric scale recovery, cross-view camera pose estimation, and view-invariant representation transfer need method and benchmark development.
- Accurate 3D scene reconstruction from monocular videos in manipulation contexts (metric scale, articulations, contact geometry) remains unreliable; targeted metrics and integration with policy learning are needed.
- Grasp synthesis from videos is brittle under occlusion and motion blur; uncertainty calibration and closed-loop corrections informed by video-derived contact priors are insufficiently explored.
- Bimanual and mobile whole-body manipulation from videos lacks breadth; coordination, constraints, and locomotion-manipulation coupling need dedicated datasets and methods.
- Action representation trade-offs are unclear; head-to-head benchmarks contrasting discrete tokenization vs diffusion/flow-matching vs spatial grids for precision, stability, and sample efficiency are missing.
- Policy adaptation with minimal robot demos is under-characterized; principled data selection (which videos, how many demos), and task transfer curves across embodiments are not reported.
- Continual learning on robots is open; methods to avoid catastrophic forgetting when adapting video-pretrained VLAs to new environments without offline retraining are scarce.
- Sim-to-real gaps persist for video-pretrained models; integrating proprioception/tactile signals and modeling contact dynamics to bridge physics mismatch has limited evidence.
- Instruction grounding remains weak; mapping textual commands to actionable, multi-step procedures when available videos are loosely related or noisy needs robust alignment and failure-mode analyses.
- Narration noise and misalignment in instructional datasets (ASR errors, temporal drift) are under-addressed; scalable automatic filtering, alignment, and segment localization pipelines are needed.
- Uncertainty estimation and OOD detection in both perception and action modules are not standard practice; criteria for abstention, human-in-the-loop intervention, and recovery are missing.
- Compute and energy costs are opaque; standardized reporting of training/inference budgets, memory footprints, and carbon impacts for large VLAs is necessary for reproducibility.
- Affordance evaluation lacks task relevance; beyond segmentation IoU, metrics that quantify impact on contact safety, manipulation success, and generalization are needed.
- Ethical and legal considerations are not operationalized; concrete guidelines for consent, privacy, copyright, and human subjects protections in robot-use of internet videos are missing.
- Reproducible, end-to-end open-source pipelines to go from raw videos to robot policies (pose extraction, segmentation, affordance grounding, action synthesis, evaluation) are incomplete or fragmented.
- Cross-robot transfer protocols are limited; systematic studies on co-training vs specialization, and how skills transfer across platforms with different kinematics and sensing are absent.
- Learning force/impedance control from visual cues is unexplored; mapping video-derived contact states to compliant actions needs datasets with synchronized force/torque ground truth.
- Curriculum and active learning for video selection are underdeveloped; methods to automatically choose informative videos/segments and discover task taxonomies are needed.
- Failure recovery from videos is rarely studied; learning retry strategies and human-like error correction behaviors from demonstrations lacks methods and benchmarks.
- Standardized measures of hand-object contact quality (e.g., contact area, pressure proxy, slip risk) and automatic metrics to evaluate video-learned manipulation are not defined.
Practical Applications
Immediate Applications
Below are concrete, deployable use cases that can leverage today’s video-based learning techniques, datasets, and VLA (vision–language–action) models to deliver near-term value.
- Sector: Manufacturing, Warehousing — Few-shot deployment of new manipulation skills
- What: Warm-start robot pick-and-place, sorting, or tool-use policies by pretraining perception on in-the-wild video representations and then fine-tuning with 20–100 on-site demos.
- Tools/workflows:
R3M,DINOv2, MAE-based encoders;OpenVLAwith LoRA fine-tuning; datasets likeOpen X-Embodiment,DROID. - Assumptions/dependencies: Calibrated cameras; modest GPU for fine-tuning; task demonstrations that reflect on-site variation; safety interlocks for cobots; licensing for any harvested web videos.
- Sector: E-commerce returns, Retail automation — Affordance-guided grasping and placement
- What: Use affordance models to localize actionable regions (handles, lids, deformable seams) for robust grasping in clutter and novel objects.
- Tools/workflows: HAG-Net-style hand cues,
AffordanceNet, VRB (Vision–Robotics Bridge) for bridging web affordances to robots; integrate into existing bin-picking stacks. - Assumptions/dependencies: Depth/RGB-D sensing; domain shift handling from web to factory lighting; gripper capability (compliance, tactile optional).
- Sector: Food service, Hospitality — Instruction-conditioned manipulation from cooking videos
- What: Extract step-wise actions from instructional videos (e.g., open, pour, stir) and align with constrained motion primitives for semi-autonomous prep tasks.
- Tools/workflows: Video–text pretraining with
HowTo100M,WebVid-10M; policy fine-tuning withRT-1/OpenVLA; language prompts describing station configuration. - Assumptions/dependencies: Strong hygiene/safety procedures; tool calibration (utensils, appliances); scripted guardrails around heat/sharp objects.
- Sector: Human–Robot Collaboration (HRC) — Action and intention monitoring for safe handovers
- What: Recognize human sub-activities and intent (reach, handover, retract) to trigger robot responses or slow/stop modes.
- Tools/workflows: Action recognition from
UCF101,Ego4D; transformer-based HOI models; rule-based safety logic. - Assumptions/dependencies: Line-of-sight cameras; latency budgets <100 ms for stop; worker consent and privacy safeguards.
- Sector: Robotics R&D, System Integration — Rapid demo capture via 3D hand modeling
- What: Collect high-quality teleop or “show and do” demos using monocular hand/body pose and map to robot kinematics for training.
- Tools/workflows:
FrankMocap,MANO,SMPL-Xfor pose;DexMV/DexVIPpipelines for mapping to robot hands; low-costLEAP Handteleop. - Assumptions/dependencies: Hand–robot embodiment mapping; camera placement; synchronization between hand/object tracking.
- Sector: Healthcare (assistive), Elder care — Personalization via in-home video demos
- What: Fine-tune assistive feeding, opening containers, or fetching tasks using a caregiver’s short phone-recorded demos.
- Tools/workflows: Behavior cloning with frozen video encoders (
R3M, masked modeling); instruction-conditioned policies (RT-1-style). - Assumptions/dependencies: Clinical oversight; fail-safe behaviors; compliance with privacy/medical data rules; household-specific environment calibration.
- Sector: Education, Makers — Low-cost lab curricula for robot learning by watching
- What: Student labs that pretrain perception on web videos and fine-tune skills on tabletop tasks with budget arms (xArm, WidowX).
- Tools/workflows:
OpenVLA,Octo, LoRA fine-tuning; datasetsEgo4D,Something-Something; MLOps notebooks for reproducibility. - Assumptions/dependencies: Entry-level robots and RGB-D cameras; compute access (one desktop GPU); permissions for web data use.
- Sector: Field service, Utilities — Video-driven procedural guidance and checklists
- What: Leverage action recognition to track progress on inspection/maintenance tasks and guide semi-autonomous steps (e.g., valve operations).
- Tools/workflows: Video–text alignment from
InternVid; action segmentation; on-device VLM prompts describing SOPs. - Assumptions/dependencies: Controlled tool sets; safety lockouts; connectivity or on-edge inference; reliable localization.
- Sector: Software, Tooling — RobotOps pipeline for video→policy training
- What: End-to-end data engineering for scraping, filtering, segmenting, and aligning web videos with ASR captions to pretrain perception before on-site fine-tuning.
- Tools/workflows:
HowTo100M,WebVid-10M,InternVid; ASR cleaning; data cards/dataset sheets; CI/CD for policy rollouts. - Assumptions/dependencies: Legal review of data provenance; scalable storage; monitoring for dataset bias and drift.
- Sector: Policy and Governance — Procurement and documentation templates
- What: Require “dataset cards” and “model cards” for any robot policies trained on internet videos, plus hazard analysis for OOD behaviors.
- Tools/workflows: Standardized documentation checklists; third-party evaluation using
RoboVQA-style visual reasoning tests. - Assumptions/dependencies: Organizational buy-in; alignment with OSHA/ISO 10218/TS 15066; periodic audits.
- Sector: Smart home, Daily life — One-minute “show-and-tell” skill teaching
- What: Capture a quick egocentric or phone video to teach a home robot routines (clear table, load dishwasher) with constraint-based safety wrappers.
- Tools/workflows: Small BC heads on top of frozen video features; instruction prompts; trajectory cloning with guardrails.
- Assumptions/dependencies: Robust object detection in clutter; fallback teleop; battery and compute constraints on consumer hardware.
Long-Term Applications
These opportunities require further research, scaling, safety validation, or new infrastructure (hardware, datasets, or policy) before broad deployment.
- Sector: Home robotics, Personal assistance — Generalist household robots that learn from any online video
- What: Robots that watch arbitrary YouTube tutorials and execute multi-step tasks with limited or zero local demos.
- Tools/workflows: Video-first world models (
GR-2), diffusion/flow-matched action policies (CogACT,π_0,GROOT N1), spatial action tokenization (SpatialVLA), large egocentric datasets (Ego-Exo-4D). - Assumptions/dependencies: Strong sim-to-real transfer; 3D scene understanding; long-horizon planning; copyright/consent solutions for training data.
- Sector: Advanced manufacturing — Zero-downtime line changeover by watching task videos
- What: Autonomously reconfigure cell behaviors for new SKUs by extracting affordances and constraints from operator demonstration videos.
- Tools/workflows: HOI/affordance pipelines; VLA reasoning with subgoal images (visual CoT); multimodal verification.
- Assumptions/dependencies: Certified safety envelopes; verifiable task plans; high-fidelity digital twins; union and safety compliance.
- Sector: Dexterous manipulation, Logistics — In-hand reorientation and non-prehensile skills learned from human videos
- What: Mastery of cloth folding, cable routing, or cap-twisting via hand-pose priors and contact reasoning.
- Tools/workflows:
DexMV/DexVIPmapping to anthropomorphic hands; self-supervised contact objectives; diffusion-based high-frequency control (π_0). - Assumptions/dependencies: Reliable tactile sensing; durable dexterous hardware; robust contact-rich simulation; wear-and-tear management.
- Sector: Healthcare, Rehab — Personalized therapy and ADL support learned from patient-specific video routines
- What: Robots that adapt to mobility constraints and home layouts by learning from caregiver/patient videos plus natural language goals.
- Tools/workflows: Instruction-conditioned VLA fine-tuning; safety-critical RL with human oversight; privacy-preserving on-device training.
- Assumptions/dependencies: Regulatory approval (FDA/CE); data protection (HIPAA/GDPR); formal safety cases; clinician-in-the-loop workflows.
- Sector: Agriculture, Construction — Video-conditioned task libraries for seasonal or site-specific operations
- What: Build and share skill packs (e.g., pruning, fastening) from curated video corpora and fine-tune on new sites.
- Tools/workflows: Continual learning over
Open X-Embodiment-style multi-embodiment corpora; retrieval-augmented video grounding for rare tasks. - Assumptions/dependencies: Robust outdoor perception; domain randomization; liability frameworks for autonomous work sites.
- Sector: Autonomy R&D — Self-improving “watch–practice–deploy” loops
- What: Continual loops that retrieve relevant internet videos, distill subgoals/affordances, practice in sim, and safely deploy updates on-robot.
- Tools/workflows: World models pretraining (
GR-2), sim-to-real pipelines, automatic reward shaping from video (DVD-style discriminators), off-policy evaluation. - Assumptions/dependencies: Reliable sim fidelity; safeguards against catastrophic forgetting; governance of self-updating systems.
- Sector: Standards, Regulation — Safety certification and auditability for video-trained robot policies
- What: Conformance tests, red-team suites, and provenance tooling for models trained on large, weakly curated video corpora.
- Tools/workflows: Standardized evaluation tasks (e.g., long-horizon manipulation benchmarks), dataset/model lineage tracking, scenario-based hazard testing.
- Assumptions/dependencies: Cross-industry consortia; legal clarity on training data; third-party certification labs.
- Sector: Education — National repositories of egocentric demonstrations for K–12 and vocational training
- What: Shared video curricula for teaching robots classroom assistance and lab safety tasks; students co-create and refine datasets.
- Tools/workflows: Curated
Ego4D/Ego-Exo-4Dderivatives; privacy-preserving publishing; classroom-safe robot kits with VLA backbones. - Assumptions/dependencies: Consent management; equitable compute access; educator training.
- Sector: Energy, Infrastructure — Learning inspection and manipulation from archival video
- What: Robots learn standard inspection/manipulation routines (valves, breakers) from historical helmet-cam footage and expert walkthroughs.
- Tools/workflows: Temporal segmentation and action parsing; video-grounded task graphs; language-guided verification.
- Assumptions/dependencies: Legacy video quality variations; precise localization; fail-safe manipulation in hazardous zones.
- Sector: Platforms and Marketplaces — Cross-embodiment skill exchanges
- What: Publish/share skills once and adapt across arms and hands using unified 3D action spaces and embodiment-robust perception.
- Tools/workflows: Unified action grids (
SpatialVLA), cross-robot datasets (Open X-Embodiment), policy distillation services. - Assumptions/dependencies: Vendor cooperation on APIs; IP/licensing models; performance guarantees across hardware variants.
Collections
Sign up for free to add this paper to one or more collections.