This paper investigates how to efficiently collect data for training robotic manipulation policies that can generalize to new, unseen scenarios. The core idea is to exploit "compositional generalization," where a policy trained on data covering individual environmental factors (e.g., different object types, various table heights) can successfully operate in situations with unseen combinations of these factors.
The authors hypothesize that if policies can compose learned skills across different environmental variations, data collection can be made significantly more efficient. Instead of collecting data for every possible combination of factors (which scales exponentially, for factors with values each), one could focus on covering individual factor values ().
Key Questions Addressed:
- When do robotic imitation learning policies exhibit compositional generalization?
- What are effective data collection strategies to exploit composition for broad generalization while reducing effort?
Data Collection Strategies Proposed and Compared:
The paper defines and evaluates several data collection strategies, visualized for factors:
- No Variation: Data collected for only a single combination of factor values.
- Single Factor: Varies only one factor at a time, keeping others at base values.
- Random: Periodically resamples entire random combinations of all factor values.
- Diagonal: Samples new combinations where each factor value has never been seen. Covers all values with changes.
- L: Varies one factor at a time from a base set of factor values. Covers all values with changes.
- Stair: Cyclically varies one factor at a time, preserving other values. Covers all values with changes but captures more diverse combinations than Diagonal or L for the same number of factor changes.
- Complete: Covers all possible combinations of factor values (often infeasible).
The strategies "Stair," "L," and "Diagonal" are designed to exploit compositional generalization by prioritizing coverage of individual factor values efficiently.
Experiments:
The authors conduct extensive experiments in both simulation and on a real robot:
- Simulation Experiments:
- Platform: Factor World, a simulation environment supporting variations in environmental factors.
- Tasks: Pick Place and Door Open.
- Factors Varied (up to 5): Object position, object texture, table texture, camera position, and distractor objects.
- Evaluation: Policies trained using behavior cloning on datasets collected with different strategies. Performance is measured by success rate on unseen combinations of factor values.
- Settings:
- Pairwise composition ( factors, values each).
- Multi-factor composition ( factors, values each).
- Real Robot Experiments:
- Platform: WidowX 250 6DOF robot arm in a real office kitchen. Task: putting a fork into a container.
- Factors Varied (primarily 5 physical/visual factors): Object type (forks), container type, table height, table texture, object position.
- Data Collection: Human demonstrations (160 total for "L" and "Stair" strategies, covering 16 combinations each).
- Policy: Diffusion goal-conditioned behavior cloning, with and without pre-training/co-fine-tuning on BridgeData V2 (a large prior robotic dataset).
- Evaluations:
- Pairwise Composition: Assessed in the "BaseKitch" environment on 9 unseen combinations for each of 10 factor pairs.
- Out-of-Domain (OOD) Transfer: Policies trained in "BaseKitch" are tested in two new kitchens ("CompKitch," "TileKitch") with inherent differences (e.g., table texture, lighting, distractors) and additional factor shifts.
- Unaccounted Factors: Robustness to distractor objects (a held-out factor) in BaseKitch.
- Camera Position Composition: Composition of camera position (main vs. secondary camera) with table texture.
Key Findings:
- Compositional Generalization Exists:
- In simulation, policies showed strong pairwise compositional abilities. Strategies like Stair, L, and Diagonal outperformed Random and approached Complete with fewer factor changes.
- On the real robot, policies also exhibited composition, particularly when leveraging prior data (BridgeData V2). The "L" strategy with prior data succeeded in 59/90 unseen pairwise combinations, compared to 28/90 without prior data and 22/90 for "No Variation" with prior data.
- Prior Data is Crucial for Real Robots:
- Leveraging prior datasets like BridgeData V2 significantly enhanced compositional abilities on the real robot. This was less critical in the cleaner simulation environment.
- Co-fine-tuning on a mix of in-domain and prior data was generally more effective than just fine-tuning a pre-trained model.
- Prior data also helped maintain robustness to unaccounted factors (e.g., distractors) that might be negatively impacted by sparse in-domain data collection strategies.
- Effective Data Collection Strategies:
- Stair generally performed best across simulation (especially for factors) and real-robot experiments (best OOD transfer: 31/40 success rate with co-fine-tuning). It balances efficient coverage of individual factors with exposure to a greater diversity of combinations compared to L or Diagonal.
- L also showed strong performance, particularly for pairwise composition analysis and OOD transfer (24/40 success with co-fine-tuning). It can be practically easier if varying factors separately is more convenient.
- Strategies exploiting composition (Stair, L, Diagonal) significantly outperformed "No Variation" and often "Random" for the same data collection effort (measured by factor changes).
- Challenges in Composition:
- Composition was generally weaker for pairs of physical factors that interact in complex ways (e.g., object position and table height, both affecting grasp motion). Visual factors or factors with less physical interaction composed more easily.
- Out-of-Domain Transfer:
- The best policy (Stair + BridgeData V2 co-fine-tuning) achieved a 77.5% (31/40) success rate in entirely new kitchens with unseen combinations of factors.
- Policies trained without prior data or without variation in in-domain data failed to transfer effectively (e.g., 0/40 and 1/40 respectively).
Practical Implications and Contributions:
The paper provides actionable insights for roboticists collecting in-domain data:
- Prioritize Factor Variation: Even if not all combinations can be covered, varying individual factors is crucial.
- Use Efficient Strategies: Strategies like "Stair" or "L" can achieve good generalization with significantly less data collection effort than trying to cover all combinations or relying on purely random variations.
- Leverage Prior Datasets: Incorporating large, diverse prior datasets (like BridgeData V2) through pre-training and co-fine-tuning is critical for robust compositional generalization, especially on real robots.
- Consider Factor Interactions: Be aware that composition might be harder for factors that have complex physical interactions. More data might be needed for such combinations.
- Co-fine-tuning is Preferred: When using prior data, co-fine-tuning with a mix of prior and new in-domain data seems more effective than just fine-tuning.
The research demonstrates that by understanding and exploiting the compositional generalization capabilities of imitation learning policies, data collection for robotic manipulation can be made more systematic and efficient, leading to policies that generalize better to novel environments and tasks.