RoboBenchMart: Retail Automation Benchmark
- RoboBenchMart is an open-source simulated benchmark suite for evaluating robotic and mobile manipulation policies in dark-store retail environments.
- It employs structured procedural scene generation to create challenging store layouts featuring dense clutter, multi-level shelves, and diverse grocery categories.
- The framework supports large-scale trajectory generation, reinforcement learning baselines, and standardized evaluation protocols to assess VLA models.
RoboBenchMart defines an open-source, simulated benchmark suite for evaluating robotic manipulation and mobile-manipulation policies in dark-store retail environments. It addresses the deficiencies of prior tabletop benchmarks by featuring dense clutter, multi-level shelving, a diverse grocery taxonomy, and realistic environmental complexity directly relevant to near-term retail automation. With a focus on structured procedural scene generation, trajectory collection, and comprehensive evaluation protocols, RoboBenchMart enables critical assessment of generalist Vision-Language-Action (VLA) models and supports reproducible research in mobile manipulation amid real-world constraints (Soshin et al., 13 Nov 2025).
1. Objectives and Significance
RoboBenchMart is designed to bridge the gap between oversimplified, stationary-arm benchmarks and the reality of dark-store retail automation. Its main objectives include:
- Providing scalable tools for procedural generation of store layouts and product arrangements, enabling systematic evaluation under diverse shelf geometries and clutter scenarios.
- Facilitating the collection of large, automatically generated demonstration trajectories and standardized evaluation pipelines for manipulation tasks in retail settings.
- Supplying fine-tuned baseline models to establish performance references and expose current limitations of VLA-based policies.
Dark stores, characterized by low human-robot interaction and repeatable layouts, represent a strategic application domain. Robust automation in these contexts promises substantial impacts in labor reduction and order throughput, contingent on the ability to solve manipulation under spatial clutter and heterogeneous object geometries.
2. Procedural Generation of Store Layouts and Product Arrangement
RoboBenchMart implements a three-phase procedural scene generator, producing randomized and navigable retail layouts.
2.1 Fixture and Tensor Field Placement
The process accepts a store footprint and initial fixtures (pallets, freezers, boxes). Fixtures are placed via rejection sampling over the footprint polygon, ensuring collision-free initialization. Subsequent shelf orientation is informed by a 2D tensor field , computed as:
where encodes the local basis at fixture edge, factoring in length and angular orientation. This draws on procedural street modeling methods.
2.2 Shelving Arrangement and Product Placement
Shelves are added in horizontal and vertical passes by interpolating the local tensor field major axis at each grid cell, subject to collision and passage constraints, with stochastic placement for diversity.
Product arrangement leverages 370 assets across 21 semantic grocery categories. Items are distributed on 4–5 shelf levels (boards), within a 2D grid per board. Clutter density,
is modulated via grid filling, Poisson-process-driven front gaps that mimic depletion, and probabilistic vertical stacking. Asset sizes are standardized via catalog and empirical measures. Placement integrates small perturbations and shelf detection via scene_synthesizer.
3. Trajectory Generation and Demonstration Collection
Approximately 3,000 demonstration trajectories are provided for a Fetch mobile manipulator, generated by:
3.1 Motion Planning
Anchor pose sequences (start, pre-grasp, grasp, post-grasp, etc.) are sampled per episode. Arm transitions attempt straight-line screw motions; collisions invoke fallback to RRT-Connect. Base motions employ heuristic planning. The overall planner achieves trajectory success.
3.2 Reinforcement Learning
Separate PPO policies, each with privileged state, are trained per task. The reward structure is
PPO training complements planning data, yielding a similar feasibility rate.
4. Supported Tasks and Evaluation Protocols
4.1 Task Taxonomy
Atomic Tasks:
- Pick to Basket: pick a named object and deposit into a cart.
- Pick from Floor: retrieve a fallen item and return to shelf.
- From Board to Board: promote an object by one shelf level.
- Open/Close Fridge: actuate a fridge or showcase door.
Composite Tasks:
- Pick N Items: chain of Pick to Basket subgoals.
- Pick from Fridge: Open → Pick to Basket → Close.
4.2 Evaluation Axes and Metrics
Axis of variation includes RP (start randomization), T (texture), SL (store layout), USA (unseen shelves), UTI (unseen-in-task items), and CUI (completely unseen items).
Core testing scenarios:
- In-Domain: RP only.
- Unseen Scenes: RP + T + SL.
- Unseen Scenes + Items: adds UTI.
Metrics:
- Success Rate:
- Average Completion Time:
- Clutter Density: as defined above.
Reporting protocol fixes per task, item, and fixture triplet.
5. Baseline Models and Empirical Results
All baselines are fine-tuned by imitation learning over 2,976 trajectories (M transitions).
Octo Transformer (93M params) receives RGB input from shoulder and gripper cameras, proprioceptive feedback, and textual commands; trained multimodally for 1M steps on 8×A100 with AdamW.
π₀, π₀.₅ VLA Flow Models (3.3B params each) employ the official JAX codebase; trained with AdamW, cosine LR decay (peak , final ), batch 256, 75K steps (~4 days).
5.1 Performance on Atomic Tasks
| Model | Scenario | Pick→Basket | PickFloor | Board→Board | OpenFridge | CloseFridge |
|---|---|---|---|---|---|---|
| Octo | In-Domain | 17 | 2 | 13 | 32 | 41 |
| Unseen Scenes | 1 | 0 | 2 | 10 | 37 | |
| π₀ | In-Domain | 22 | 29 | 15 | 48 | 83 |
| Unseen Scenes | 1 | 12 | 5 | 25 | 75 | |
| π₀.₅ | In-Domain | 63 | 44 | 55 | 50 | 85 |
| Unseen Scenes | 38 | 11 | 22 | 37 | 77 |
Composite tasks (Pick 3, Pick from Fridge) yielded 0% success for all models. In the Unseen Scenes + Items scenario, π₀.₅ achieved 10–38% on some atomic tasks, while Octo and π₀ failed entirely.
6. Failure Analysis, Limitations, and Future Extensions
6.1 Common Failure Modes
- Gripper misalignment causing failed grasps or object drops.
- Incorrect object selection amid clutter.
- Collisions with or disturbance of adjacent products.
- Successful grasp but improper placement.
6.2 Benchmark Limitations
- Only a parallel-jaw gripper is supported; absence of suction or dexterous hands.
- Restriction to rigid-body assets; deformables such as bags are not included.
- Larger than typical shelf gaps to ensure feasible grasps.
6.3 Anticipated Future Directions
Planned enhancements include extending support to suction and multi-finger end-effectors, inclusion of deformable and irregular products, dynamic rendering resolution, and additional task domains such as packing, anomaly detection, and multi-robot collaboration.
All software, assets, and tools are publicly released under permissive terms (MIT for code, CC BY-NC 4.0 for assets) (Soshin et al., 13 Nov 2025).