PARTNR Benchmark Suite
- PARTNR Benchmark is a suite of evaluation frameworks designed for embodied planning, multi-agent collaboration, algorithmic risk management, and ambiguity-resolving interactive learning.
- It encompasses tasks ranging from simulated pick-and-place experiments to large-scale human–robot household collaborations, employing metrics such as success rate and sample efficiency.
- The benchmarks provide practical insights into online interactive learning, robust ambiguity detection, and multi-agent planning, setting clear directions for advancing AI and robotics.
The PARTNR benchmark is a suite of rigorous, publicly-documented evaluation frameworks designed for embodied planning, multi-agent collaboration, algorithmic risk-management, and ambiguity-resolving interactive learning across robotics, AI, and computational finance. Multiple distinct research efforts have appropriated the PARTNR acronym—ranging from tabletop pick-and-place learning (Luijkx et al., 2022), to large-scale human–robot household collaboration (Chang et al., 2024, Li et al., 8 Jul 2025), to risk-management parallel computation (Chancelier et al., 2010)—each with bespoke tasks, formalisms, and experimental protocols. Below, the principal PARTNR benchmarks are detailed along axes of problem definition, experimental setup, evaluation metrics, algorithmic baselines, empirical findings, and open research challenges.
1. Definitions and Problem Formulations
Interactive Pick-and-Place Ambiguity Benchmark (Luijkx et al., 2022)
The PARTNR (Pick and place Ambiguity Resolving by Trustworthy iNteractive leaRning) benchmark is instantiated as a simulated tabletop manipulation task: three colored cubic blocks and three colored cylindrical bowls are placed at random, non-overlapping positions on a single table. The agent is presented, per episode, with three language-based commands of the canonical form: “Pick the [pick_color] block and place it in the [place_color] bowl.” Colors are drawn from both in-distribution (seen) and out-of-distribution (unseen) sets to induce domain shift.
The agent must select discrete pixel coordinates for the pick and place locations in sequence, conditioned on top-down RGB imagery and the command (embedded via a CLIP encoder). The multidisciplinary challenge is to resolve ambiguity in under-specified visual-linguistic contexts, proactively query for demonstrations, and adapt the pick-and-place policy via interactive learning.
Human–Robot Collaboration and Planning Benchmark (Chang et al., 2024, Li et al., 8 Jul 2025)
The PARTNR benchmark for planning and reasoning in embodied multi-agent tasks targets human–robot teamwork in household environments. An episode is specified by an initial Habitat 3.0 simulator state , a free-form natural-language task (generated by LLMs using in-prompt retrieval), and an evaluation function . Task types include:
- Constraint-free (C) — unconstrained ordering and execution
- Spatial (S) — spatial relations (“next to”, “on top of”)
- Temporal (T) — partial-order constraints
- Heterogeneous (H) — agent-specific abilities (“wash”, “pour”)
Each instruction is decomposed into predicates , with temporal dependencies encoded as a DAG. The environment is formalized as a two-agent multi-agent POMDP over the joint state space of object, agent, and room configurations, with both full and partial observability regimes.
2. Dataset Construction and Simulation Workflows
Interactive Pick-and-Place (Luijkx et al., 2022)
- Offline Data: Heuristic “expert” demonstrations, 500/1000/1500 trajectories.
- Online Data: Additional demonstrations are requested on-the-fly when the policy is ambiguous (detection based on topological analysis of predicted Q heatmaps).
- Splits: Examples include balanced 50% offline + 50% online, and other mixed-source splits to match the total data budget.
- Modalities: RGB image, language command (CLIP embedding), and ground-truth labels.
- Domain shifts: Evaluated under both color generalization (seen vs unseen sets) and noisy demonstration conditions ( px label noise).
Multi-Agent Planning (Chang et al., 2024, Li et al., 8 Jul 2025)
- Scale: 100,000 episodes (train), 1,000 (val, test), 60 houses from HSSD, 5,819 object classes, tasks automatically generated and curated via LLMs and simulation-in-the-loop work.
- Validation: All episodes undergo human-in-the-loop feasibility checks using tools like PrediViz.
Dataset Statistics Table
| Split | Houses | Episodes | Object Classes |
|---|---|---|---|
| Train | 37 | 100,000 | 5,819 |
| Val | 13 | 1,000 | 5,819 |
| Test | 10 | 1,000 | 5,819 |
3. Core Algorithms and Technical Constructs
Ambiguity Detection and Interactive Learning (Luijkx et al., 2022)
Given observation , Q-value heatmaps and are computed using separate CNNs. Local maxima in these maps are extracted via persistent homology. For each set of maxima with raw scores , a softmax over these yields confidences ; ambiguity is flagged and a demonstration is requested if the highest mode’s normalized confidence is below a learned threshold .
A sensitivity-controlled gating function adapts this threshold based on the empirical rates of true positives and false negatives (sliding window size , desired sensitivity , learning rate ):
All new samples are aggregated and used to update the policy (DAgger-style).
Human–Robot Multi-Agent Planning (Chang et al., 2024, Li et al., 8 Jul 2025)
- POMDP formalization with agent-specific action sets (e.g., Spot’s abstract navigation and manipulation; humanoid-exclusive “Clean”, “Pour”).
- Tasks parameterized by partial/full observability; joint vs decentralized planners.
- Planning algorithms include centralized and decentralized ReAct LLM-based planners with fine-tuning (LoRA) and retrieval-augmented prompting.
- Evaluation functions recognize compositional, spatial, and temporal goal satisfaction via satisfaction of predicates.
4. Evaluation Protocols and Metrics
Pick-and-Place Benchmark (Luijkx et al., 2022)
- Success rate: (Correctly executed commands) / (total commands), e.g., 300 per evaluation.
- Query statistics: TP, FP, FN, TN—used to compute sensitivity and specificity.
- Domain adaptation speed: Success improvement as online demonstrations are added.
- Sample efficiency: Success as function of demonstration count (500/1000/1500).
Key Results Table (Success %)
| Algorithm | Data Split | 500 | 1000 | 1000 Noisy | 1500 |
|---|---|---|---|---|---|
| Baseline | 100% off, seen | 28.3 | 51.7 | 82.7 | 62.7 |
| PARTNR | 50% off + 50% int | 30.3 | 57.3 | 91.0 | 80.3 |
| PARTNR 80% | 50% off + 30% int | 28.0 | 39.3 | 77.7 | 68.0 |
| Baseline | 100% off, unseen | 19.0 | 22.0 | — | 16.7 |
| PARTNR | 50% off + 50% int | 30.7 | 53.0 | — | 78.3 |
Multi-Agent Planning Benchmark (Chang et al., 2024, Li et al., 8 Jul 2025)
- Percent-complete:
- Success:
- Efficiency: Simulation steps per episode; LLM planning cycles; extraneous action count; task-offloading ratio.
- Human and human–AI baselines: Steps and completion percentages for solo, dual human, and human–LLM teams.
Planner Baseline Table (Chang et al., 2024)
| Method | Steps | Success | PC | Cycles |
|---|---|---|---|---|
| Heuristic-Expert | 1261 | 0.84 | 0.94 | – |
| ReAct Centr., full | 1347(34) | 0.74 | 0.88 | 17.5(.3) |
| ReAct Decent., part | 3295(76) | 0.73 | 0.86 | 15.2(.3) |
| ReAct-RAG | 3467(82) | 0.71 | 0.84 | 14.8(.3) |
| Finetuned (8B) | 3229(75) | 0.70 | 0.84 | 12.9(.2) |
| ReAct (learned) | 6495(182) | 0.57 | 0.76 | 22.7(.6) |
| ReAct (CG) | 12491(209) | 0.30 | 0.56 | 23.8(.5) |
5. Empirical Findings and Analysis
- Interactive learning improves generalization: Online ambiguity querying and correction yield substantial gains in out-of-distribution (unseen colors) success, and higher sample efficiency, often matching baselines with fewer demonstrations (Luijkx et al., 2022).
- Robustness to noise: Realistic Gaussian annotation noise boosts both baseline and interactive methods, but interactive learning maintains a larger performance gap over the baseline.
- Human–LLM collaboration: Human–LLM teams do not yet match human pair efficiency in multi-agent planning (humans require fewer steps), but fine-tuned LLMs approach solo human baselines; an 8B model, fine-tuned, matches 70B performance at significant inference speedup (Chang et al., 2024).
- Reasoning models superiority: Chain-of-thought and reasoning-centric models (e.g., OpenAI o3-mini) outperform standard LLMs (e.g., GPT-4o, Llama 3) in both centralized/decentralized, full/partial observability regimes. The margin is more pronounced for planning correctness (percent-complete, success rate) but comes with increased computational cost per decision (Li et al., 8 Jul 2025).
- Failure modes: Common errors include order-of-operations, spatial localization, custom-syntax mishandling, coordination breakdown in decentralized settings, and premature task termination. Reasoning models recover more robustly from such errors.
6. Systemic Insights and Extensions
- Technical lessons: Transparent serialization, stateless master–slave scheduling, and fine-grained task granularity yield strong scaling and flexibility for parallel architectures, as in the original risk-management testbed (Chancelier et al., 2010).
- Bottlenecks: Communication overhead, single-master scheduling, and suboptimal load balancing are limiting at scale. Remedies include task-bundling, hierarchical scheduling, and asynchronous execution.
- Research challenges: Core open problems include multi-agent coordination, error detection/recovery, robust spatial/temporal grounding, integration of perception and planning failures, and efficient human–robot interaction protocols (Chang et al., 2024).
- Future directions: Integration with richer multimodal models, curriculum learning, human-correction phases, continuous control, and expanded task complexity are active priorities (Li et al., 8 Jul 2025).
7. Impact and Role in Advancing Embodied and Parallel AI
The PARTNR benchmark family provides rigorously specified, reproducible platforms for measuring progress in ambiguity-resolving policy learning, collaborative embodied reasoning, and massively parallel computational finance. They serve as reference points for evaluating interactive, sample-efficient learning, natural language grounding, embodied reasoning, and efficient real-world deployment in both robotics and algorithmic systems (Luijkx et al., 2022, Chang et al., 2024, Chancelier et al., 2010, Li et al., 8 Jul 2025).
By concretely quantifying the limitations of state-of-the-art LLMs and embodied agents, these benchmarks highlight the gap between NLP-trained models and robust, error-tolerant embodied intelligence and offer clear guidelines for methodological and architectural improvements in interactive, data-driven, and scalable AI research.