PARTNR Benchmark Suite

Updated 16 February 2026

PARTNR Benchmark is a suite of evaluation frameworks designed for embodied planning, multi-agent collaboration, algorithmic risk management, and ambiguity-resolving interactive learning.
It encompasses tasks ranging from simulated pick-and-place experiments to large-scale human–robot household collaborations, employing metrics such as success rate and sample efficiency.
The benchmarks provide practical insights into online interactive learning, robust ambiguity detection, and multi-agent planning, setting clear directions for advancing AI and robotics.

The PARTNR benchmark is a suite of rigorous, publicly-documented evaluation frameworks designed for embodied planning, multi-agent collaboration, algorithmic risk-management, and ambiguity-resolving interactive learning across robotics, AI, and computational finance. Multiple distinct research efforts have appropriated the PARTNR acronym—ranging from tabletop pick-and-place learning (Luijkx et al., 2022), to large-scale human–robot household collaboration (Chang et al., 2024, Li et al., 8 Jul 2025), to risk-management parallel computation (Chancelier et al., 2010)—each with bespoke tasks, formalisms, and experimental protocols. Below, the principal PARTNR benchmarks are detailed along axes of problem definition, experimental setup, evaluation metrics, algorithmic baselines, empirical findings, and open research challenges.

1. Definitions and Problem Formulations

The PARTNR (Pick and place Ambiguity Resolving by Trustworthy iNteractive leaRning) benchmark is instantiated as a simulated tabletop manipulation task: three colored cubic blocks and three colored cylindrical bowls are placed at random, non-overlapping positions on a single table. The agent is presented, per episode, with three language-based commands of the canonical form: “Pick the [pick_color] block and place it in the [place_color] bowl.” Colors are drawn from both in-distribution (seen) and out-of-distribution (unseen) sets to induce domain shift.

The agent must select discrete pixel coordinates $(u,v)$ for the pick and place locations in sequence, conditioned on top-down RGB imagery and the command (embedded via a CLIP encoder). The multidisciplinary challenge is to resolve ambiguity in under-specified visual-linguistic contexts, proactively query for demonstrations, and adapt the pick-and-place policy via interactive learning.

The PARTNR benchmark for planning and reasoning in embodied multi-agent tasks targets human–robot teamwork in household environments. An episode $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ is specified by an initial Habitat 3.0 simulator state $\mathcal{S}_0$ , a free-form natural-language task $\tau$ (generated by LLMs using in-prompt retrieval), and an evaluation function $\mathcal{C}$ . Task types include:

Constraint-free (C) — unconstrained ordering and execution
Spatial (S) — spatial relations (“next to”, “on top of”)
Temporal (T) — partial-order constraints
Heterogeneous (H) — agent-specific abilities (“wash”, “pour”)

Each instruction is decomposed into $N$ predicates $\varphi_i$ , with temporal dependencies encoded as a DAG. The environment is formalized as a two-agent multi-agent POMDP over the joint state space of object, agent, and room configurations, with both full and partial observability regimes.

2. Dataset Construction and Simulation Workflows

Offline Data: Heuristic “expert” demonstrations, 500/1000/1500 trajectories.
Online Data: Additional demonstrations are requested on-the-fly when the policy is ambiguous (detection based on topological analysis of predicted Q heatmaps).
Splits: Examples include balanced 50% offline + 50% online, and other mixed-source splits to match the total data budget.
Modalities: RGB image, language command (CLIP embedding), and ground-truth $(u,v)$ labels.
Domain shifts: Evaluated under both color generalization (seen vs unseen sets) and noisy demonstration conditions ( $\mathcal{N}(0, 3^2)$ px label noise).

Scale: 100,000 episodes (train), 1,000 (val, test), 60 houses from HSSD, 5,819 object classes, tasks automatically generated and curated via LLMs and simulation-in-the-loop work.
Validation: All episodes undergo human-in-the-loop feasibility checks using tools like PrediViz.

Dataset Statistics Table

Split	Houses	Episodes	Object Classes
Train	37	100,000	5,819
Val	13	1,000	5,819
Test	10	1,000	5,819

3. Core Algorithms and Technical Constructs

Given observation $o_t=(image, command)$ , Q-value heatmaps $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ 0 and $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ 1 are computed using separate CNNs. Local maxima in these maps are extracted via persistent homology. For each set of $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ 2 maxima $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ 3 with raw scores $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ 4, a softmax over these yields confidences $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ 5; ambiguity is flagged and a demonstration is requested if the highest mode’s normalized confidence $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ 6 is below a learned threshold $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ 7.

A sensitivity-controlled gating function adapts this threshold based on the empirical rates of true positives and false negatives (sliding window size $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ 8, desired sensitivity $E=(\mathcal{S}_0,\,\tau,\,\mathcal{C})$ 9, learning rate $\mathcal{S}_0$ 0):

$\mathcal{S}_0$ 1

All new samples are aggregated and used to update the policy (DAgger-style).

POMDP formalization with agent-specific action sets (e.g., Spot’s abstract navigation and manipulation; humanoid-exclusive “Clean”, “Pour”).
Tasks parameterized by partial/full observability; joint vs decentralized planners.
Planning algorithms include centralized and decentralized ReAct LLM-based planners with fine-tuning (LoRA) and retrieval-augmented prompting.
Evaluation functions recognize compositional, spatial, and temporal goal satisfaction via satisfaction of predicates.

4. Evaluation Protocols and Metrics

Success rate: (Correctly executed commands) / (total commands), e.g., 300 per evaluation.
Query statistics: TP, FP, FN, TN—used to compute sensitivity and specificity.
Domain adaptation speed: Success improvement as online demonstrations are added.
Sample efficiency: Success as function of demonstration count (500/1000/1500).

Key Results Table (Success %)

Algorithm	Data Split	500	1000	1000 Noisy	1500
Baseline	100% off, seen	28.3	51.7	82.7	62.7
PARTNR	50% off + 50% int	30.3	57.3	91.0	80.3
PARTNR 80%	50% off + 30% int	28.0	39.3	77.7	68.0
Baseline	100% off, unseen	19.0	22.0	—	16.7
PARTNR	50% off + 50% int	30.7	53.0	—	78.3

Percent-complete: $\mathcal{S}_0$ 2
Success: $\mathcal{S}_0$ 3
Efficiency: Simulation steps per episode; LLM planning cycles; extraneous action count; task-offloading ratio.
Human and human–AI baselines: Steps and completion percentages for solo, dual human, and human–LLM teams.

Method	Steps	Success	PC	Cycles
Heuristic-Expert	1261	0.84	0.94	–
ReAct Centr., full	1347(34)	0.74	0.88	17.5(.3)
ReAct Decent., part	3295(76)	0.73	0.86	15.2(.3)
ReAct-RAG	3467(82)	0.71	0.84	14.8(.3)
Finetuned (8B)	3229(75)	0.70	0.84	12.9(.2)
ReAct (learned)	6495(182)	0.57	0.76	22.7(.6)
ReAct (CG)	12491(209)	0.30	0.56	23.8(.5)

5. Empirical Findings and Analysis

Interactive learning improves generalization: Online ambiguity querying and correction yield substantial gains in out-of-distribution (unseen colors) success, and higher sample efficiency, often matching baselines with fewer demonstrations (Luijkx et al., 2022).
Robustness to noise: Realistic Gaussian annotation noise boosts both baseline and interactive methods, but interactive learning maintains a larger performance gap over the baseline.
Human–LLM collaboration: Human–LLM teams do not yet match human pair efficiency in multi-agent planning (humans require fewer steps), but fine-tuned LLMs approach solo human baselines; an 8B model, fine-tuned, matches 70B performance at significant inference speedup (Chang et al., 2024).
Reasoning models superiority: Chain-of-thought and reasoning-centric models (e.g., OpenAI o3-mini) outperform standard LLMs (e.g., GPT-4o, Llama 3) in both centralized/decentralized, full/partial observability regimes. The margin is more pronounced for planning correctness (percent-complete, success rate) but comes with increased computational cost per decision (Li et al., 8 Jul 2025).
Failure modes: Common errors include order-of-operations, spatial localization, custom-syntax mishandling, coordination breakdown in decentralized settings, and premature task termination. Reasoning models recover more robustly from such errors.

6. Systemic Insights and Extensions

Technical lessons: Transparent serialization, stateless master–slave scheduling, and fine-grained task granularity yield strong scaling and flexibility for parallel architectures, as in the original risk-management testbed (Chancelier et al., 2010).
Bottlenecks: Communication overhead, single-master scheduling, and suboptimal load balancing are limiting at scale. Remedies include task-bundling, hierarchical scheduling, and asynchronous execution.
Research challenges: Core open problems include multi-agent coordination, error detection/recovery, robust spatial/temporal grounding, integration of perception and planning failures, and efficient human–robot interaction protocols (Chang et al., 2024).
Future directions: Integration with richer multimodal models, curriculum learning, human-correction phases, continuous control, and expanded task complexity are active priorities (Li et al., 8 Jul 2025).

7. Impact and Role in Advancing Embodied and Parallel AI

The PARTNR benchmark family provides rigorously specified, reproducible platforms for measuring progress in ambiguity-resolving policy learning, collaborative embodied reasoning, and massively parallel computational finance. They serve as reference points for evaluating interactive, sample-efficient learning, natural language grounding, embodied reasoning, and efficient real-world deployment in both robotics and algorithmic systems (Luijkx et al., 2022, Chang et al., 2024, Chancelier et al., 2010, Li et al., 8 Jul 2025).

By concretely quantifying the limitations of state-of-the-art LLMs and embodied agents, these benchmarks highlight the gap between NLP-trained models and robust, error-tolerant embodied intelligence and offer clear guidelines for methodological and architectural improvements in interactive, data-driven, and scalable AI research.

Markdown Report Issue Upgrade to Chat

References (4)

PARTNR: Pick and place Ambiguity Resolving by Trustworthy iNteractive leaRning (2022)

PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks (2024)

Evaluation of Habitat Robotics using Large Language Models (2025)

Using Premia and Nsp for Constructing a Risk Management Benchmark for Testing Parallel Architecture (2010)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PARTNR Benchmark.

PARTNR Benchmark Suite

1. Definitions and Problem Formulations

Interactive Pick-and-Place Ambiguity Benchmark (Luijkx et al., 2022)

Human–Robot Collaboration and Planning Benchmark (Chang et al., 2024, Li et al., 8 Jul 2025)

2. Dataset Construction and Simulation Workflows

Interactive Pick-and-Place (Luijkx et al., 2022)

Multi-Agent Planning (Chang et al., 2024, Li et al., 8 Jul 2025)

Dataset Statistics Table

3. Core Algorithms and Technical Constructs

Ambiguity Detection and Interactive Learning (Luijkx et al., 2022)

Human–Robot Multi-Agent Planning (Chang et al., 2024, Li et al., 8 Jul 2025)

4. Evaluation Protocols and Metrics

Pick-and-Place Benchmark (Luijkx et al., 2022)

Key Results Table (Success %)

Multi-Agent Planning Benchmark (Chang et al., 2024, Li et al., 8 Jul 2025)

Planner Baseline Table (Chang et al., 2024)

5. Empirical Findings and Analysis

6. Systemic Insights and Extensions

7. Impact and Role in Advancing Embodied and Parallel AI

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

PARTNR Benchmark Suite

1. Definitions and Problem Formulations

Interactive Pick-and-Place Ambiguity Benchmark (Luijkx et al., 2022)

Human–Robot Collaboration and Planning Benchmark (Chang et al., 2024, Li et al., 8 Jul 2025)

2. Dataset Construction and Simulation Workflows

Interactive Pick-and-Place (Luijkx et al., 2022)

Multi-Agent Planning (Chang et al., 2024, Li et al., 8 Jul 2025)

Dataset Statistics Table

3. Core Algorithms and Technical Constructs

Ambiguity Detection and Interactive Learning (Luijkx et al., 2022)

Human–Robot Multi-Agent Planning (Chang et al., 2024, Li et al., 8 Jul 2025)

4. Evaluation Protocols and Metrics

Pick-and-Place Benchmark (Luijkx et al., 2022)

Key Results Table (Success %)

Multi-Agent Planning Benchmark (Chang et al., 2024, Li et al., 8 Jul 2025)

Planner Baseline Table (Chang et al., 2024)

5. Empirical Findings and Analysis

6. Systemic Insights and Extensions

7. Impact and Role in Advancing Embodied and Parallel AI

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics