Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 72 tok/s

Gemini 2.5 Pro 41 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 24 tok/s Pro

GPT-4o 115 tok/s Pro

Kimi K2 203 tok/s Pro

GPT OSS 120B 451 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

PHYRE-B Benchmark

Updated 9 August 2025

PHYRE-B is a controlled evaluation suite focused on physical reasoning, requiring agents to place a single ball in deterministic Newtonian scenes.
It emphasizes rapid learning through the AUCCESS metric, challenging agents with within-template and cross-template generalization tasks.
The benchmark drives research into model-based planning, online adaptation, and causal inference to improve sample-efficient and robust learning.

The PHYRE-B benchmark is a controlled evaluation suite developed to measure and advance the sample efficiency and generalization ability of artificial agents in physical reasoning tasks. It focuses on a specific subclass ("PHYRE-B") in which an agent modifies a two-dimensional deterministic Newtonian scene by placing a single ball, with the objective of solving classical mechanics puzzles designed to probe human-like intuition and model-building in physical systems.

1. Design Objectives and Structure

PHYRE, or "PHYsical REasoning," is motivated by the need for benchmarks that disentangle physical reasoning from unrelated perception and actuation challenges. The PHYRE-B tier introduces a continuous 3D action space (ball center coordinates and radius) and restricts interventions to a single ball per scene. The environment is governed by deterministic gravity, collision, and limited friction; distractors prevalent in realistic vision-based datasets are absent by design to focus agent evaluation on reasoning under physical constraints. Each task is the instantiation of a template (“template” denotes the parameterized puzzle family, e.g., spatial configuration, object dimensions), and is defined by an initial scene with static and dynamic objects (such as balls, bars, standing sticks, or jars) and a symbolic goal relation (e.g., causing two objects to remain in contact for at least three seconds).

The evaluation protocol distinguishes between within-template generalization (solving new scenes derived from familiar templates) and cross-template generalization (addressing novel templates unseen in training), explicitly testing an agent’s ability to extrapolate underlying physical principles rather than simply memorize strategies or exploit overfit heuristics.

2. Task Composition and Stability

Each PHYRE-B puzzle consists of:

A pre-specified deterministic scene with objects configured to pose a novel interaction challenge.
A specified goal: formalized as a triplet (subject, relation, object), with the relation typically being persistent contact for a minimal duration.
One-shot intervention space: Placement of a single ball, with variable center and radius (continuous 3D), prior to simulation rollout.

Diversity within a template is achieved by varying parameters (object locations, scales, and rotations), ensuring individual puzzle instances do not share a trivial “master solution.” Furthermore, templates are engineered for “solution stability”: solution trajectories must remain valid under small perturbations (e.g., translation by 0.5 pixels), making robust, generalizable policies necessary for high performance.

3. Evaluation Metric: Sample Efficiency via AUCCESS

A central innovation of PHYRE-B is its sample efficiency evaluation, formalized by the Area Under the Success-percentage Curve (AUCCESS):

$\text{AUCCESS} = \frac{\sum_k w_k \cdot s_k}{\sum_k w_k}$

where $s_k$ is the percentage of tasks solved within $k$ attempts and $w_k = \log(k + 1) - \log(k)$ . This weighting scheme exponentially privileges early solutions. For instance, not solving within 10 attempts caps the possible AUCCESS at 50%, emphasizing the agent’s ability to learn quickly from a limited set of interactions. This operationalizes the benchmark’s focus on sample-efficient learning agents.

4. Baseline Agent Performance and Environmental Challenges

PHYRE-B exposes multiple challenges for contemporary learning algorithms:

Extreme sample complexity: Baseline agents (including random [RAND], memory-based [MEM], and DQN variants) commonly require thousands to hundreds of thousands of simulation attempts per task. For PHYRE-B, a random agent may demand around 10,000 attempts; PHYRE-2B (double-ball) tasks are even more difficult for current models.
Continuous action ranking: Effective exploration requires agents to rank and propose solutions in large, continuous spaces; methods with insufficient action diversity or strategies that cluster similar actions without diversity suffer from significant performance drop-offs or plateaus.
Limited use of feedback: Despite receiving detailed intermediate rollout states after each failed attempt, agents mostly operate in a contextual bandit regime, typically learning little from failed trajectories.
Generalization gap: Performance in cross-template settings sharply lags within-template results, highlighting overfitting and weak transfer of physical knowledge.

5. Algorithmic Advances and Directions

The PHYRE-B benchmark highlights several promising directions:

Online adaptation: Online-updating agents (e.g., MEM-O, DQN-O), which adapt action ranking based on intermediate feedback, outperform static policies in the cross-template regime, provided update aggressiveness is tuned to avoid overfitting to unreliable test-time signals.
Forward prediction and counterfactual reasoning: Agents equipped with learnable forward-dynamics models—capable of predicting consequences of candidate actions without repeatedly querying the simulator—enable more efficient, model-based search strategies. The integration of counterfactual reasoning (forecasting the probable outcomes of alternate interventions) is seen as critical for surpassing current baselines.
Action space diversity: Explicit encouragement of diverse action proposals, either through sampling or ranking mechanisms, mitigates over-exploitation of narrow action clusters, ensuring better coverage and discovery in the continuous action domain.
Causal inference and invariance: Leveraging methods from causal inference (e.g., Invariant Causal Prediction) may allow agents to identify and exploit the fundamental Newtonian invariances underpinning all task templates, increasing generalization and robustness.

6. Impact, Extensions, and Research Implications

PHYRE-B catalyzes research in several key areas:

Sample-efficient model-based learning: The emphasis on rapid learning with minimal interaction is directly relevant for autonomous experimentation, scientific discovery, and robotics, where rollouts are expensive or constrained.
Benchmark-driven progress: The standardization of puzzles, metric (AUCCESS), and generalization splits provides a consistent yardstick for measuring advances in sample efficiency, model-based planning, and robust reasoning.
Pedagogical value: Task generation and solution-stability criteria are designed to drive the development of agents that do not merely optimize for narrow policies but instead internalize core physical principles.
Interactions with other benchmarks: PHYRE-B is complementary to other physical simulation benchmarks, such as those emphasizing high-dimensional state prediction or full-scene rendering, but is uniquely geared towards single-shot, intuition-like reasoning and efficient learning from limited attempts.

7. Future Directions

Potential future research inspired by PHYRE-B includes:

Incorporation of richer object affordances and partial observability to more closely parallel real-world reasoning.
Scaling forward-prediction and counterfactual modules to more complex or higher-dimensional problems.
Integration with causal discovery or hybrid neuro-symbolic reasoning frameworks to further improve transfer and efficiency.
Extending the benchmark suite to encompass interactive and multi-step planning paradigms (as in I-PHYRE), to bridge the gap between one-shot interventions and sustained, temporally coordinated interaction.

In summary, PHYRE-B represents a rigorously constructed benchmark that foregrounds sample efficiency, structured generalization, and robustness in physical reasoning. By systematically exposing the deficiencies of prevailing agent architectures and proposing clear directions for improvement, it serves as a cornerstone for research into agents capable of human-like, adaptive, and data-efficient physical problem solving (Bakhtin et al., 2019).

PDF Markdown Chat (Pro)

References (1)

PHYRE: A New Benchmark for Physical Reasoning (2019)

Follow Topic

Get notified by email when new papers are published related to PHYRE-B Benchmark.