Papers
Topics
Authors
Recent
Search
2000 character limit reached

PARTNR Benchmark Suite

Updated 16 February 2026
  • PARTNR Benchmark is a suite of evaluation frameworks designed for embodied planning, multi-agent collaboration, algorithmic risk management, and ambiguity-resolving interactive learning.
  • It encompasses tasks ranging from simulated pick-and-place experiments to large-scale human–robot household collaborations, employing metrics such as success rate and sample efficiency.
  • The benchmarks provide practical insights into online interactive learning, robust ambiguity detection, and multi-agent planning, setting clear directions for advancing AI and robotics.

The PARTNR benchmark is a suite of rigorous, publicly-documented evaluation frameworks designed for embodied planning, multi-agent collaboration, algorithmic risk-management, and ambiguity-resolving interactive learning across robotics, AI, and computational finance. Multiple distinct research efforts have appropriated the PARTNR acronym—ranging from tabletop pick-and-place learning (Luijkx et al., 2022), to large-scale human–robot household collaboration (Chang et al., 2024, Li et al., 8 Jul 2025), to risk-management parallel computation (Chancelier et al., 2010)—each with bespoke tasks, formalisms, and experimental protocols. Below, the principal PARTNR benchmarks are detailed along axes of problem definition, experimental setup, evaluation metrics, algorithmic baselines, empirical findings, and open research challenges.

1. Definitions and Problem Formulations

The PARTNR (Pick and place Ambiguity Resolving by Trustworthy iNteractive leaRning) benchmark is instantiated as a simulated tabletop manipulation task: three colored cubic blocks and three colored cylindrical bowls are placed at random, non-overlapping positions on a single table. The agent is presented, per episode, with three language-based commands of the canonical form: “Pick the [pick_color] block and place it in the [place_color] bowl.” Colors are drawn from both in-distribution (seen) and out-of-distribution (unseen) sets to induce domain shift.

The agent must select discrete pixel coordinates (u,v)(u,v) for the pick and place locations in sequence, conditioned on top-down RGB imagery and the command (embedded via a CLIP encoder). The multidisciplinary challenge is to resolve ambiguity in under-specified visual-linguistic contexts, proactively query for demonstrations, and adapt the pick-and-place policy via interactive learning.

The PARTNR benchmark for planning and reasoning in embodied multi-agent tasks targets human–robot teamwork in household environments. An episode E=(S0,τ,C)E=(\mathcal{S}_0,\,\tau,\,\mathcal{C}) is specified by an initial Habitat 3.0 simulator state S0\mathcal{S}_0, a free-form natural-language task τ\tau (generated by LLMs using in-prompt retrieval), and an evaluation function C\mathcal{C}. Task types include:

  • Constraint-free (C) — unconstrained ordering and execution
  • Spatial (S) — spatial relations (“next to”, “on top of”)
  • Temporal (T) — partial-order constraints
  • Heterogeneous (H) — agent-specific abilities (“wash”, “pour”)

Each instruction is decomposed into NN predicates φi\varphi_i, with temporal dependencies encoded as a DAG. The environment is formalized as a two-agent multi-agent POMDP over the joint state space of object, agent, and room configurations, with both full and partial observability regimes.

2. Dataset Construction and Simulation Workflows

  • Offline Data: Heuristic “expert” demonstrations, 500/1000/1500 trajectories.
  • Online Data: Additional demonstrations are requested on-the-fly when the policy is ambiguous (detection based on topological analysis of predicted Q heatmaps).
  • Splits: Examples include balanced 50% offline + 50% online, and other mixed-source splits to match the total data budget.
  • Modalities: RGB image, language command (CLIP embedding), and ground-truth (u,v)(u,v) labels.
  • Domain shifts: Evaluated under both color generalization (seen vs unseen sets) and noisy demonstration conditions (N(0,32)\mathcal{N}(0, 3^2) px label noise).
  • Scale: 100,000 episodes (train), 1,000 (val, test), 60 houses from HSSD, 5,819 object classes, tasks automatically generated and curated via LLMs and simulation-in-the-loop work.
  • Validation: All episodes undergo human-in-the-loop feasibility checks using tools like PrediViz.

Dataset Statistics Table

Split Houses Episodes Object Classes
Train 37 100,000 5,819
Val 13 1,000 5,819
Test 10 1,000 5,819

3. Core Algorithms and Technical Constructs

Given observation ot=(image,command)o_t=(image, command), Q-value heatmaps QpickQ_\text{pick} and QplaceQ_\text{place} are computed using separate CNNs. Local maxima in these maps are extracted via persistent homology. For each set of kk maxima Ti=(ui,vi)T_i=(u_i,v_i) with raw scores viv_i, a softmax over these yields confidences p^i\hat p_i; ambiguity is flagged and a demonstration is requested if the highest mode’s normalized confidence p^act\hat p_\text{act} is below a learned threshold pactthrp_\text{act}^\text{thr}.

A sensitivity-controlled gating function adapts this threshold based on the empirical rates of true positives and false negatives (sliding window size wnw_n, desired sensitivity sdess_\text{des}, learning rate aa):

pthr=p0thra(sdess^)p^\mathrm{thr} = p_0^\mathrm{thr} - a (s_\mathrm{des} - \hat s)

All new samples are aggregated and used to update the policy (DAgger-style).

  • POMDP formalization with agent-specific action sets (e.g., Spot’s abstract navigation and manipulation; humanoid-exclusive “Clean”, “Pour”).
  • Tasks parameterized by partial/full observability; joint vs decentralized planners.
  • Planning algorithms include centralized and decentralized ReAct LLM-based planners with fine-tuning (LoRA) and retrieval-augmented prompting.
  • Evaluation functions recognize compositional, spatial, and temporal goal satisfaction via satisfaction of predicates.

4. Evaluation Protocols and Metrics

  • Success rate: (Correctly executed commands) / (total commands), e.g., 300 per evaluation.
  • Query statistics: TP, FP, FN, TN—used to compute sensitivity and specificity.
  • Domain adaptation speed: Success improvement as online demonstrations are added.
  • Sample efficiency: Success as function of demonstration count (500/1000/1500).

Key Results Table (Success %)

Algorithm Data Split 500 1000 1000 Noisy 1500
Baseline 100% off, seen 28.3 51.7 82.7 62.7
PARTNR 50% off + 50% int 30.3 57.3 91.0 80.3
PARTNR 80% 50% off + 30% int 28.0 39.3 77.7 68.0
Baseline 100% off, unseen 19.0 22.0 16.7
PARTNR 50% off + 50% int 30.7 53.0 78.3
  • Percent-complete: PC=1Ni=1N1[φi satisfied]PC = \frac{1}{N}\sum_{i=1}^N \mathbf{1}[\varphi_i \text{ satisfied}]
  • Success: S=1[PC=1]S = \mathbf{1}[PC=1]
  • Efficiency: Simulation steps per episode; LLM planning cycles; extraneous action count; task-offloading ratio.
  • Human and human–AI baselines: Steps and completion percentages for solo, dual human, and human–LLM teams.
Method Steps Success PC Cycles
Heuristic-Expert 1261 0.84 0.94
ReAct Centr., full 1347(34) 0.74 0.88 17.5(.3)
ReAct Decent., part 3295(76) 0.73 0.86 15.2(.3)
ReAct-RAG 3467(82) 0.71 0.84 14.8(.3)
Finetuned (8B) 3229(75) 0.70 0.84 12.9(.2)
ReAct (learned) 6495(182) 0.57 0.76 22.7(.6)
ReAct (CG) 12491(209) 0.30 0.56 23.8(.5)

5. Empirical Findings and Analysis

  • Interactive learning improves generalization: Online ambiguity querying and correction yield substantial gains in out-of-distribution (unseen colors) success, and higher sample efficiency, often matching baselines with fewer demonstrations (Luijkx et al., 2022).
  • Robustness to noise: Realistic Gaussian annotation noise boosts both baseline and interactive methods, but interactive learning maintains a larger performance gap over the baseline.
  • Human–LLM collaboration: Human–LLM teams do not yet match human pair efficiency in multi-agent planning (humans require fewer steps), but fine-tuned LLMs approach solo human baselines; an 8B model, fine-tuned, matches 70B performance at significant inference speedup (Chang et al., 2024).
  • Reasoning models superiority: Chain-of-thought and reasoning-centric models (e.g., OpenAI o3-mini) outperform standard LLMs (e.g., GPT-4o, Llama 3) in both centralized/decentralized, full/partial observability regimes. The margin is more pronounced for planning correctness (percent-complete, success rate) but comes with increased computational cost per decision (Li et al., 8 Jul 2025).
  • Failure modes: Common errors include order-of-operations, spatial localization, custom-syntax mishandling, coordination breakdown in decentralized settings, and premature task termination. Reasoning models recover more robustly from such errors.

6. Systemic Insights and Extensions

  • Technical lessons: Transparent serialization, stateless master–slave scheduling, and fine-grained task granularity yield strong scaling and flexibility for parallel architectures, as in the original risk-management testbed (Chancelier et al., 2010).
  • Bottlenecks: Communication overhead, single-master scheduling, and suboptimal load balancing are limiting at scale. Remedies include task-bundling, hierarchical scheduling, and asynchronous execution.
  • Research challenges: Core open problems include multi-agent coordination, error detection/recovery, robust spatial/temporal grounding, integration of perception and planning failures, and efficient human–robot interaction protocols (Chang et al., 2024).
  • Future directions: Integration with richer multimodal models, curriculum learning, human-correction phases, continuous control, and expanded task complexity are active priorities (Li et al., 8 Jul 2025).

7. Impact and Role in Advancing Embodied and Parallel AI

The PARTNR benchmark family provides rigorously specified, reproducible platforms for measuring progress in ambiguity-resolving policy learning, collaborative embodied reasoning, and massively parallel computational finance. They serve as reference points for evaluating interactive, sample-efficient learning, natural language grounding, embodied reasoning, and efficient real-world deployment in both robotics and algorithmic systems (Luijkx et al., 2022, Chang et al., 2024, Chancelier et al., 2010, Li et al., 8 Jul 2025).

By concretely quantifying the limitations of state-of-the-art LLMs and embodied agents, these benchmarks highlight the gap between NLP-trained models and robust, error-tolerant embodied intelligence and offer clear guidelines for methodological and architectural improvements in interactive, data-driven, and scalable AI research.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PARTNR Benchmark.