Real-World mRAG Planning Benchmark

Updated 14 August 2025

Real-World mRAG Planning Benchmark is a framework that rigorously evaluates multimodal retrieval-augmented generation systems and planning agents in complex, dynamic scenarios.
The benchmark employs a three-phase planning model—individual plan generation, best-response coordination, and detailed timetabling—to assess scalability, cost improvement, and feasibility.
By integrating real-world transportation data, it quantifies trade-offs between cost efficiency, journey prolongation, and timetable feasibility, offering actionable insights for system scalability and user welfare.

Real-World mRAG Planning (RemPlan) Benchmark denotes a class of benchmarks intended to rigorously evaluate multimodal retrieval-augmented generation (mRAG) systems and planning agents in complex, dynamic, real-world scenarios. Such benchmarks assess the integration of multi-agent planning, retrieval, and adaptive decision-making over heterogeneous data, commonly in domains where scalability, coordination, and trade-off optimization are critical (e.g., travel sharing, logistics, and collaborative agent problem solving). Core benchmark features include scenario complexity, evaluation criteria capturing solution quality and feasibility, environmental realism, and engineering scalability.

1. Multiagent Planning Principles and Algorithmic Frameworks

RemPlan-style benchmarks assess systems that generate coordinated plans or solutions for multiple agents, each with individual preferences, under real-world constraints. A canonical example is the strategic multiagent travel sharing system, in which the planning process is formally decomposed into three phases:

Individual Plan Generation (Initial Phase): Each agent independently solves for an optimal path in a “relaxed domain,” typically represented as a directed graph $G = (V, E)$ with stops as nodes and minimal travel-time connections as edges. Off-the-shelf single-agent planners (e.g., LAMA) compute shortest solo journeys.
Best-Response Multiagent Coordination (Best-Response Phase): Plans are merged and iteratively improved using a best-response planning (BRP) approach. Here, given a joint plan $\pi^{(k)}$ , each agent $i$ calculates:

$\pi^{(k+1)} = \arg\min \{C_i(\pi)\ |\ \pi \text{ identical to } \pi^{(k)} \text{ for all } j \ne i \}$

where $C_i(\pi)$ is the personalized cost for agent $i$ . Costs are dynamically discounted for shared segments:

$c_{i, n} = \left(\frac{0.8}{n} + 0.2\right) \cdot c_i$

with $n$ co-travelers on a segment. The process converges to an individually rational Nash equilibrium (no agent benefits from unilateral deviation) under standard domain assumptions.

Full-Domain Timetabling (Timetabling Phase): The abstract joint plan is mapped to a fine-grained temporal planning domain, partitioned into independent subgroups. Each group’s sequential, agent-consistent segments (“parts”) are planned using temporal single-agent planners (e.g., SGP, POPF2), incorporating real-world timetable constraints and synchronizing departures, arrivals, and service availability.

The engineering design is characterized by the modular separation of concerns: initial (individual and best-response) planning phases are domain-independent and parallelizable, while only the timetable mapping phase embeds real-world, domain-specific constraints (Hrnčíř et al., 2013).

2. Evaluation Metrics and Trade-off Quantification

RemPlan benchmarks measure system performance along multiple dimensions to capture both computational efficiency and solution quality:

Scalability: Empirical studies consistently demonstrate linear scaling of computation time relative to both the number of agents and the scenario/domain size, avoiding exponential blowup typical of naïve joint planning.
Total Cost Improvement: The benefit of multiagent collaboration is quantified as

$\Delta C = \frac{\sum_i C_i(\pi_i) - \sum_i C'_i(\pi_N)}{\sum_i C_i(\pi_i)}$

where $C_i(\pi_i)$ is agent $i$ ’s solo cost and $C'_i(\pi_N)$ their cost in the shared plan, including group discounts.

Feasible Solution Rate: The fraction of agent groups for which executable (time-consistent) schedules are found after mapping to the full domain.
Journey Prolongation: The proportional increase in journey duration for shared plans versus solo journeys.

This multi-metric analysis highlights critical trade-offs: increasing group size reduces per-person cost but decreases timetable feasibility and increases journey duration. Optimal group sizes (typically up to 4–5 agents) balance these competing objectives (Hrnčíř et al., 2013).

3. Real-World Data Integration and Application

RemPlan benchmarks are grounded in real-world, multimodal datasets, necessitating extensive preprocessing:

Domain Construction: Raw transportation datasets (e.g., UK NPTDR and NaPTAN) are transformed via spatially enabled databases, deduplicated, and augmented (walking links, corrected stop mappings).
Dual Abstraction: Domains are analyzed at two granularities—a relaxed graph for tractable, optimistic planning and a full temporal graph for final validation.
Complexity and Constraints: The real-world domain may contain hundreds of thousands of active connections, and feasible timetabling must respect precisely matched service times, making large-scale real-world scheduling nontrivial.
Engineering Insights: Empirical results demonstrate modular, layered designs can yield solutions for nontrivial group sizes. When over-constrained, larger groups may be dynamically repartitioned to salvage shared savings (Hrnčíř et al., 2013).

4. Scalability, Parallelism, and System Architecture

Effective RemPlan approaches emphasize decomposition to maintain scalability:

Phase Decoupling: By isolating initial and best-response phases (which are predominantly parallelizable single-agent or small-group computations), the framework manages exponential state-space growth.
Timetabling Parallelization: Independent travel groups are planned in parallel, maximizing resource utilization.
Guidance for Engineering: Layered, domain-independent modules accommodate adaptation to various real-world planning domains (e.g., logistics, network routing).

The result is a modular planning architecture capable of handling domains with high data volume, agent heterogeneity, and tight temporal constraints with predictable computational cost scaling (Hrnčíř et al., 2013).

5. Environmental and User-Centric Objectives

A distinguishing property of RemPlan is explicit joint optimization for system-wide and individual gains:

Environmental Impact: By maximizing shared travel segments, benchmarks model scenarios that minimize net vehicle usage and increase public transport capacity utilization, yielding measurable reductions in emissions, congestion, and network strain.
User Welfare: Shared planning delivers individually rational solutions, often with direct cost benefits (discounted tickets, lower per-person cost), and may offer secondary social or convenience advantages with only a controlled increase in journey duration (usually <30% for a substantial portion of groups).
Trade-off Management: Critical constraints are that solo cost reduction is capped (via the cost function’s lower bound) and user benefit is always nonnegative, upholding incentive compatibility (Hrnčíř et al., 2013).

6. Practical Implementation and Engineering Process

The translation of RemPlan algorithms from theory to robust practice depends on several engineering steps:

Data Preprocessing: Raw XML sources are imported, matched, and filtered via spatial SQL databases, resolving duplications and enriching connectivity.
Domain Modeling: Planning representations are constructed as PDDL models for both the relaxed and full temporal domains, ensuring compatibility with standard AI planning toolkits.
Module Design: Initial plan generation and best-response modules are developed to interface with off-the-shelf planners, maintaining domain independence and testability.
Algorithm Integration: The system progresses from solo plan generation, to Nash-equilibrium-based best-response steps, and then to timetable-mapping with temporal planners, iterating as necessary for feasibility.
Performance Considerations: Emphasis on parallelism at independent stages further enhances throughput and scalability, directly enabling deployment in operational settings.

This careful separation of phases, combined with explicit cost-sharing formulas and robust timetable mapping, enables RemPlan-style benchmarks to serve as templates for deployment of multiagent and multi-modal planning systems in real-world, high-stakes environments (Hrnčíř et al., 2013).

7. Implications, Limitations, and Future Directions

RemPlan benchmarks provide rigorous, quantitative foundations for the evaluation and deployment of multiagent planning systems with retrieval-augmented, real-world constraints:

Contribution: The strategic decomposition of planning, the use of Nash-equilibrium-based best-response mechanisms, and real-world grounded cost models demonstrate feasible, scalable solutions with meaningful environmental and user value.
Known Limitations: Timetable feasibility falls sharply beyond small group sizes due to combinatorial constraints; trade-offs between cost improvement and journey time must be actively managed. Further, manual engineering is required to extend methodology to variants (e.g., continuous time, multimodal logistics).
Research Frontiers: Potential directions include automated dynamic group splitting, richer agent preference modeling, adaptive cost-sharing mechanisms, and extension to broader domains (including logistics, network routing, and cooperative robotics).

RemPlan thus occupies a central position in the methodological landscape of real-world mRAG and multiagent planning evaluation, providing repeatable, scalable benchmarks and design principles directly transferable to practical collaborative planning systems.

PDF Markdown Chat (Pro)

References (1)

Applying Strategic Multiagent Planning to Real-World Travel Sharing Problems (2013)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Real-World mRAG Planning (RemPlan) Benchmark.