DeliveryBench: Multi-Domain Delivery Benchmark

Updated 26 December 2025

DeliveryBench is a comprehensive benchmark ecosystem featuring realistic simulations, industrial-scale datasets, and rigorous evaluation protocols for multi-modal delivery challenges.
It formalizes diverse delivery optimization problems—embodied delivery, dynamic pickup, and last-mile routing—with explicit constraints, mathematical frameworks, and evaluation metrics.
The platform supports reproducible research through high-fidelity simulators, standardized benchmarks, and algorithmic baselines that bridge theoretical optimization with real-world logistics.

DeliveryBench denotes a rigorous, multi-dimensional benchmark ecosystem for evaluating algorithms and embodied agents in the context of real-world delivery logistics. It encompasses a spectrum ranging from classic static and dynamic vehicle routing to last-mile and embodied city-scale delivery, with strong emphasis on realistic constraints, data fidelity, and reproducibility. Major DeliveryBench resources span industrial-scale datasets, interactive high-fidelity simulators, formal task specifications, and standardized evaluation protocols.

1. Formal Problem Domains and Mathematical Structure

DeliveryBench implementations instantiate diverse delivery optimization problems, united by precise mathematical frameworks.

City-Scale Embodied Delivery (Food Delivery Benchmark)
- Physical pose and current vehicle (walking, e-scooter, car, or bus)
- Resource levels (stamina $m_t$ , battery $b_t$ , cash $w_t$ )
- Active order pool (statuses and deadlines), associated food-attribute tensors ( $T$ , fragility, odor)
- Global map structure (road network, POIs)

The action space $\mathcal{A}$ includes: - Navigation actions (high-level MOVE_TO, low-level step/turn) - Order manipulation (view, accept, pickup, deliver) - Resource and social actions (rest, recharge, buy, rent, help)

The principal objective is profit maximization subject to domain constraints:

$\pi^\star \in \arg\max_{\pi\in\Pi_\mathcal{C}} \mathbb{E}_\pi[E - C]$

where $E$ is aggregate earnings (including delivery and rating-based bonuses), $C$ is total costs, and $\Pi_\mathcal{C}$ is the set of constraint-feasible policies (Mao et al., 22 Dec 2025).

Dynamic Pickup and Delivery (DPDP) Formulated over a directed network $G=(F,A)$ among factories, with real-time arrival of orders $O = \{o_1,\ldots,o_N\}$ , each $o_i = (F_{p_i}, F_{d_i}, q_i, t_{e_i}, t_{l_i})$ . Fleet $V$ of $K$ vehicles (capacities $Q$ ) manages assignments $\Pi_k$ . Key constraints include order fulfillment, vehicle capacity, strict time-windows, driver shifts, LIFO loading, and dock waiting. The objective minimizes $f = \lambda f_1 + f_2$ with $f_1$ total lateness and $f_2$ average vehicle-distance, $\lambda\gg1$ (Hao et al., 2022).
Planar VRP and Multi-objective Benchmarks In the classic static VRPBench scenario, the domain is a planar graph $G=(V,E)$ with depot $\pi$ , customers $C=V\setminus\{\pi\}$ , and $k$ vehicles. Routes $R_v$ must minimize $f(R) = \sum_{v=1}^k W(R_v)$ for total route length, or alternative objective mixes including minimizing routes or load variance. VRPBench supports augmentation for time-windows, capacities, and lexicographic or Pareto multi-objective optimization (Zeni et al., 2016).
Last-Mile Delivery (LaDe) Instance data is structured over spatio-temporal events with courier trajectories, package location, time-windows, and city/region segmentation (graph $\mathcal{G}$ ). Formal definitions emphasize task sequences, time-window constraints, and region-level flow (Wu et al., 2023).

2. Datasets, Simulators, and Synthetic Environments

DeliveryBench encompasses a range of dataset and simulator resources, each targeting distinct facets of the delivery problem:

Domain	Benchmark/Dataset	Scale	Format/Simulator	Key Features
Embodied Delivery	DeliveryBench	9 cities	Unreal Engine sim	Long-horizon city-scale, rich agent actions
DPDP	Huawei DPDP Comp.	2k–5k jobs	JSON/event sim	Dynamic, dock/shift, time-windows, LIFO
Planar VRP	VRPBench	1k–10k pts	2D graph tools	Real geography, multi-objective, visualization
Last-mile	LaDe	10.7M pkgs	Annotated CSV/JSON	Courier GPS, five cities, package events

DeliveryBench utilizes a high-fidelity, Unreal Engine–powered simulator for 3D city environments, integrating procedural map generation, resource-constrained mobility, food quality physics, and social/teamwork actions (Mao et al., 22 Dec 2025).
DPDP Benchmark orchestrates a discrete-event simulator, enforcing rigorous time constraints (600s per optimization tick), multi-pallet handling, and empirical factory networks (Hao et al., 2022).
VRPBench provides planar city-graph extraction tools, density-driven sampling, and GUI solution overlays, producing instances up to 10,000 nodes, far exceeding typical TSPLib/VRP test sizes (Zeni et al., 2016).
LaDe is a real-world courier dataset with >10M packages, 21k couriers, and second-level spatiotemporal logs from multiple cities, standardizing last-mile task modeling (Wu et al., 2023).

3. Task Definitions, Constraints, and Evaluation Protocols

DeliveryBench distinguishes itself by formalizing a suite of demanding real-world tasks with explicit constraints and evaluation regimes:

Embodied Delivery
- Task: Maximize net profit over episodic agent deployment, subject to delivery deadlines, food quality, stamina, battery, economic trade-offs, sequencing, and social constraints.
- Evaluation: Hourly net profit, on-time rate, planning and resource metrics, violation fractions, food/customer satisfaction, stratified by city size and agent type (Mao et al., 22 Dec 2025).
Dynamic Pickup and Delivery
- Tasks: Online dispatch, vehicle scheduling with full respect for capacity, time-windows, shift and dock availability, LIFO constraints.
- Evaluation: Primary score $f = \lambda f_1 + f_2$ , plus service rate, average/maximum lateness, computational resource tracking (Hao et al., 2022).
Static/Planar VRP
- Task: Solve for cost-optimal k-vehicle routes under given customer distributions, with extensions for capacity and time-window.
- Evaluation: Total/average route length, vehicle count, fairness, solution feasibility, with visualization and potential lower-bound gap reporting (Zeni et al., 2016).
Last-Mile Datasets
- Tasks: Route prediction (sequence learning), ETA prediction, spatio-temporal demand forecasting (region graph).
- Evaluation: HR@k, Kendall’s $\tau$ , LMD, edit distance for routing; MAE, RMSE, ACC@ $\Delta$ for ETA; MAE/RMSE for forecasting, stratified by city, scenario, and AOI (Wu et al., 2023).

4. Baselines, Algorithmic Approaches, and Leaderboards

DeliveryBench supports meaningful comparison by providing baseline implementations, reproducible protocols, and (often) public leaderboards:

Embodied Agent Baselines
- VLM models: GPT-5, GPT-4o, Claude-3.7, Gemini, open-source Qwen and LLaMA variants.
- Humans serve as empirical upper baselines: in DeliveryBench, humans earn \$50–63/hr versus best LLM \$31/hr in small/medium cities.
- Planning ablations and context engineering show substantial gains (e.g., using explicit plan-annotation or “dynamic cheatsheet” raises performance).
- Multi-agent settings: Small teams optimal, large-team performance sharply drops due to coordination brittleness (Mao et al., 22 Dec 2025).
DPDP Competition
- Winner: Variable Neighborhood Search; runner-ups rely on heuristics augmented for dynamic constraints.
- RL/DRL approaches trail well-tuned combinatorial and threshold-based heuristics due to real-time and scalability limits.
- Runtimes and real-time compliance are strictly enforced within the evaluation loop (Hao et al., 2022).
Planar VRP
- Instances vastly exceed traditional benchmarks (10k vs TSPLib/VRPTW max 1k).
- Algorithms can optimize for distance, vehicle count, or fairness. Lower bounds are used when available for measuring solution quality (Zeni et al., 2016).
LaDe Regional Models
- Classical (TimeGreedy, DistanceGreedy, Or-Tools), ML (OSquare, Graph2Route, FDNET), and GNN-based models (ASTGCN, DCRNN).
- Example: Graph2Route achieves HR@3 ≈ 71.7 %, outperforming heuristic baselines in last-mile route prediction (Wu et al., 2023).

5. Domain-Specific Constraints and Realism

DeliveryBench environments and datasets are engineered for high-fidelity with respect to operational delivery constraints:

Resource Coupling: All major environments enforce stamina/battery/cash/vehicle/routing interactions; depletions yield forced halts or cost penalties.
Food Quality Modeling: Discrete modeling of thermal drift, fragility (damage by vibration), and odor transfer introduces cross-order dependencies not found in classic VRP/OR benchmarks.
Social/Competition Dynamics: Multi-agent competition for order pools, charging stalls, and explicit cooperative help actions are implemented in embodied simulators.
Dock and Shift Realism: In DPDP, vehicles must respect factory dock capacities, FCFS service, queueing, and work shift intervals. LIFO constraints on pallet loading restrict feasible delivery sequences.
Density and Heterogeneity: Planar VRPBench leverages empirical street-class, type, and land-use-based density factors for realistic customer distributions (Zeni et al., 2016). LaDe spans five heterogeneous cities with distinct demand/supply patterns (Wu et al., 2023).

6. Extensions, Adaptability, and Comparative Strengths

DeliveryBench supports robust extensibility:

Scenario Expansion: Embodied simulation, classic VRP, last-mile, and DPDP frameworks enable research on static/dynamic routing, multi-objective optimization, teamwork, and stochastic/demand-forecasting tasks.
Adapting to New Contexts: VRPBench provides a documented pipeline for generating instances from arbitrary city street networks, supporting adaptation via GIS/OSM imports and land-use-based density modeling (Zeni et al., 2016).
Open Toolchains: DeliveryBench releases code, data, and configuration schemas for integration and public evaluation.
Comparative Strengths: Relative to TSPLib, Solomon, and CVRPLib, DeliveryBench instances offer order-of-magnitude scaling, empirical geographical heterogeneity, multi-objective support, and context-rich simulation (Zeni et al., 2016).

7. Impact, Current Limitations, and Future Directions

DeliveryBench has created a unified platform for evaluating logistics and embodied reasoning in realistic, constraint-dense settings.

Revealed Gaps: Current agents (notably VLM-based) underperform humans, especially in parallel task management and resource foresight; they are prone to commonsense violations (e.g., letting food items spoil) and have brittle multi-agent coordination (Mao et al., 22 Dec 2025).
Algorithmic Insights: Heuristic and hybrid methods remain competitive under dynamic constraints and real-time budgets; pure RL/DRL is not yet dominating at operational scales (Hao et al., 2022).
Forward Trajectory: Future efforts will target real-time embodied reasoning, large-scale RL/imitation training, richer social/negotiation mechanisms, and removal of privileged spatial modal priors to test generalization (Mao et al., 22 Dec 2025).

DeliveryBench serves as a comprehensive, reproducible, and practical foundation for advancing research in multi-faceted delivery domains, closing the gap between theoretical optimization and deployment in complex, real-world environments (Mao et al., 22 Dec 2025, Wu et al., 2023, Zeni et al., 2016, Hao et al., 2022).

Markdown Upgrade to Chat

References (4)

DeliveryBench: Can Agents Earn Profit in Real World? (2025)

Introduction to The Dynamic Pickup and Delivery Problem Benchmark -- ICAPS 2021 Competition (2022)

VRPBench: A Vehicle Routing Benchmark Tool (2016)

LaDe: The First Comprehensive Last-mile Delivery Dataset from Industry (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeliveryBench.