DeliveryBench: Multi-Domain Delivery Benchmark
- DeliveryBench is a comprehensive benchmark ecosystem featuring realistic simulations, industrial-scale datasets, and rigorous evaluation protocols for multi-modal delivery challenges.
- It formalizes diverse delivery optimization problems—embodied delivery, dynamic pickup, and last-mile routing—with explicit constraints, mathematical frameworks, and evaluation metrics.
- The platform supports reproducible research through high-fidelity simulators, standardized benchmarks, and algorithmic baselines that bridge theoretical optimization with real-world logistics.
DeliveryBench denotes a rigorous, multi-dimensional benchmark ecosystem for evaluating algorithms and embodied agents in the context of real-world delivery logistics. It encompasses a spectrum ranging from classic static and dynamic vehicle routing to last-mile and embodied city-scale delivery, with strong emphasis on realistic constraints, data fidelity, and reproducibility. Major DeliveryBench resources span industrial-scale datasets, interactive high-fidelity simulators, formal task specifications, and standardized evaluation protocols.
1. Formal Problem Domains and Mathematical Structure
DeliveryBench implementations instantiate diverse delivery optimization problems, united by precise mathematical frameworks.
- City-Scale Embodied Delivery (Food Delivery Benchmark)
- Physical pose and current vehicle (walking, e-scooter, car, or bus)
- Resource levels (stamina , battery , cash )
- Active order pool (statuses and deadlines), associated food-attribute tensors (, fragility, odor)
- Global map structure (road network, POIs)
The action space includes: - Navigation actions (high-level MOVE_TO, low-level step/turn) - Order manipulation (view, accept, pickup, deliver) - Resource and social actions (rest, recharge, buy, rent, help)
The principal objective is profit maximization subject to domain constraints:
where is aggregate earnings (including delivery and rating-based bonuses), is total costs, and is the set of constraint-feasible policies (Mao et al., 22 Dec 2025).
- Dynamic Pickup and Delivery (DPDP) Formulated over a directed network among factories, with real-time arrival of orders , each . Fleet of vehicles (capacities ) manages assignments . Key constraints include order fulfillment, vehicle capacity, strict time-windows, driver shifts, LIFO loading, and dock waiting. The objective minimizes with total lateness and average vehicle-distance, (Hao et al., 2022).
- Planar VRP and Multi-objective Benchmarks In the classic static VRPBench scenario, the domain is a planar graph with depot , customers , and vehicles. Routes must minimize for total route length, or alternative objective mixes including minimizing routes or load variance. VRPBench supports augmentation for time-windows, capacities, and lexicographic or Pareto multi-objective optimization (Zeni et al., 2016).
- Last-Mile Delivery (LaDe) Instance data is structured over spatio-temporal events with courier trajectories, package location, time-windows, and city/region segmentation (graph ). Formal definitions emphasize task sequences, time-window constraints, and region-level flow (Wu et al., 2023).
2. Datasets, Simulators, and Synthetic Environments
DeliveryBench encompasses a range of dataset and simulator resources, each targeting distinct facets of the delivery problem:
| Domain | Benchmark/Dataset | Scale | Format/Simulator | Key Features |
|---|---|---|---|---|
| Embodied Delivery | DeliveryBench | 9 cities | Unreal Engine sim | Long-horizon city-scale, rich agent actions |
| DPDP | Huawei DPDP Comp. | 2k–5k jobs | JSON/event sim | Dynamic, dock/shift, time-windows, LIFO |
| Planar VRP | VRPBench | 1k–10k pts | 2D graph tools | Real geography, multi-objective, visualization |
| Last-mile | LaDe | 10.7M pkgs | Annotated CSV/JSON | Courier GPS, five cities, package events |
- DeliveryBench utilizes a high-fidelity, Unreal Engine–powered simulator for 3D city environments, integrating procedural map generation, resource-constrained mobility, food quality physics, and social/teamwork actions (Mao et al., 22 Dec 2025).
- DPDP Benchmark orchestrates a discrete-event simulator, enforcing rigorous time constraints (600s per optimization tick), multi-pallet handling, and empirical factory networks (Hao et al., 2022).
- VRPBench provides planar city-graph extraction tools, density-driven sampling, and GUI solution overlays, producing instances up to 10,000 nodes, far exceeding typical TSPLib/VRP test sizes (Zeni et al., 2016).
- LaDe is a real-world courier dataset with >10M packages, 21k couriers, and second-level spatiotemporal logs from multiple cities, standardizing last-mile task modeling (Wu et al., 2023).
3. Task Definitions, Constraints, and Evaluation Protocols
DeliveryBench distinguishes itself by formalizing a suite of demanding real-world tasks with explicit constraints and evaluation regimes:
- Embodied Delivery
- Task: Maximize net profit over episodic agent deployment, subject to delivery deadlines, food quality, stamina, battery, economic trade-offs, sequencing, and social constraints.
- Evaluation: Hourly net profit, on-time rate, planning and resource metrics, violation fractions, food/customer satisfaction, stratified by city size and agent type (Mao et al., 22 Dec 2025).
- Dynamic Pickup and Delivery
- Tasks: Online dispatch, vehicle scheduling with full respect for capacity, time-windows, shift and dock availability, LIFO constraints.
- Evaluation: Primary score , plus service rate, average/maximum lateness, computational resource tracking (Hao et al., 2022).
- Static/Planar VRP
- Task: Solve for cost-optimal k-vehicle routes under given customer distributions, with extensions for capacity and time-window.
- Evaluation: Total/average route length, vehicle count, fairness, solution feasibility, with visualization and potential lower-bound gap reporting (Zeni et al., 2016).
- Last-Mile Datasets
- Tasks: Route prediction (sequence learning), ETA prediction, spatio-temporal demand forecasting (region graph).
- Evaluation: HR@k, Kendall’s , LMD, edit distance for routing; MAE, RMSE, ACC@ for ETA; MAE/RMSE for forecasting, stratified by city, scenario, and AOI (Wu et al., 2023).
4. Baselines, Algorithmic Approaches, and Leaderboards
DeliveryBench supports meaningful comparison by providing baseline implementations, reproducible protocols, and (often) public leaderboards:
- Embodied Agent Baselines
- VLM models: GPT-5, GPT-4o, Claude-3.7, Gemini, open-source Qwen and LLaMA variants.
- Humans serve as empirical upper baselines: in DeliveryBench, humans earn \$50–63/hr versus best LLM \$31/hr in small/medium cities.
- Planning ablations and context engineering show substantial gains (e.g., using explicit plan-annotation or “dynamic cheatsheet” raises performance).
- Multi-agent settings: Small teams optimal, large-team performance sharply drops due to coordination brittleness (Mao et al., 22 Dec 2025).
- DPDP Competition
- Winner: Variable Neighborhood Search; runner-ups rely on heuristics augmented for dynamic constraints.
- RL/DRL approaches trail well-tuned combinatorial and threshold-based heuristics due to real-time and scalability limits.
- Runtimes and real-time compliance are strictly enforced within the evaluation loop (Hao et al., 2022).
- Planar VRP
- Instances vastly exceed traditional benchmarks (10k vs TSPLib/VRPTW max 1k).
- Algorithms can optimize for distance, vehicle count, or fairness. Lower bounds are used when available for measuring solution quality (Zeni et al., 2016).
- LaDe Regional Models
- Classical (TimeGreedy, DistanceGreedy, Or-Tools), ML (OSquare, Graph2Route, FDNET), and GNN-based models (ASTGCN, DCRNN).
- Example: Graph2Route achieves HR@3 ≈ 71.7 %, outperforming heuristic baselines in last-mile route prediction (Wu et al., 2023).
5. Domain-Specific Constraints and Realism
DeliveryBench environments and datasets are engineered for high-fidelity with respect to operational delivery constraints:
- Resource Coupling: All major environments enforce stamina/battery/cash/vehicle/routing interactions; depletions yield forced halts or cost penalties.
- Food Quality Modeling: Discrete modeling of thermal drift, fragility (damage by vibration), and odor transfer introduces cross-order dependencies not found in classic VRP/OR benchmarks.
- Social/Competition Dynamics: Multi-agent competition for order pools, charging stalls, and explicit cooperative help actions are implemented in embodied simulators.
- Dock and Shift Realism: In DPDP, vehicles must respect factory dock capacities, FCFS service, queueing, and work shift intervals. LIFO constraints on pallet loading restrict feasible delivery sequences.
- Density and Heterogeneity: Planar VRPBench leverages empirical street-class, type, and land-use-based density factors for realistic customer distributions (Zeni et al., 2016). LaDe spans five heterogeneous cities with distinct demand/supply patterns (Wu et al., 2023).
6. Extensions, Adaptability, and Comparative Strengths
DeliveryBench supports robust extensibility:
- Scenario Expansion: Embodied simulation, classic VRP, last-mile, and DPDP frameworks enable research on static/dynamic routing, multi-objective optimization, teamwork, and stochastic/demand-forecasting tasks.
- Adapting to New Contexts: VRPBench provides a documented pipeline for generating instances from arbitrary city street networks, supporting adaptation via GIS/OSM imports and land-use-based density modeling (Zeni et al., 2016).
- Open Toolchains: DeliveryBench releases code, data, and configuration schemas for integration and public evaluation.
- Comparative Strengths: Relative to TSPLib, Solomon, and CVRPLib, DeliveryBench instances offer order-of-magnitude scaling, empirical geographical heterogeneity, multi-objective support, and context-rich simulation (Zeni et al., 2016).
7. Impact, Current Limitations, and Future Directions
DeliveryBench has created a unified platform for evaluating logistics and embodied reasoning in realistic, constraint-dense settings.
- Revealed Gaps: Current agents (notably VLM-based) underperform humans, especially in parallel task management and resource foresight; they are prone to commonsense violations (e.g., letting food items spoil) and have brittle multi-agent coordination (Mao et al., 22 Dec 2025).
- Algorithmic Insights: Heuristic and hybrid methods remain competitive under dynamic constraints and real-time budgets; pure RL/DRL is not yet dominating at operational scales (Hao et al., 2022).
- Forward Trajectory: Future efforts will target real-time embodied reasoning, large-scale RL/imitation training, richer social/negotiation mechanisms, and removal of privileged spatial modal priors to test generalization (Mao et al., 22 Dec 2025).
DeliveryBench serves as a comprehensive, reproducible, and practical foundation for advancing research in multi-faceted delivery domains, closing the gap between theoretical optimization and deployment in complex, real-world environments (Mao et al., 22 Dec 2025, Wu et al., 2023, Zeni et al., 2016, Hao et al., 2022).