MaD Physics: Evaluating information seeking under constraints in physical environments

Published 11 May 2026 in cs.AI and cs.LG | (2605.10820v1)

Abstract: Scientific discovery is fundamentally a resource-constrained process that requires navigating complex trade-offs between the quality and quantity of measurements due to physical and cost constraints. Measurements drive the scientific process by revealing novel phenomena to improve our understanding. Existing benchmarks for evaluating agents for scientific discovery focus on either static knowledge-based reasoning or unconstrained experimental design tasks, and do not capture the ability to make measurements and plan under constraints. To bridge this gap, we propose Measuring and Discovering Physics (MaD Physics), a benchmark to evaluate the ability of agents to make informative measurements and conclusions subject to constraints on the quality and quantity of measurements. The benchmark consists of three environments, each based on a distinct physical law. To mitigate contamination from existing knowledge, MaD Physics includes altered physical laws. In each trial, the agent makes measurements of the system until it exhausts an allotted budget and then the agent has to infer the underlying physical law to make predictions about the state of the system in the future. MaD Physics evaluates two fundamental capabilities of scientific agents: inferring models from data and planning under constraints. We also demonstrate how MaD Physics can be used to evaluate other capabilities such as multimodality and in-context learning. We benchmark agents on MaD Physics using four Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash), identifying shortcomings in their structured exploration and data collection capabilities and highlighting directions to improve their scientific reasoning.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces MAD Physics, a benchmark that forces AI agents to balance measurement quality, quantity, and cost in modified physical domains.
It employs diverse simulation environments in classical, fluid, and quantum mechanics to rigorously assess model-based reasoning and experimental planning.
Empirical results reveal that even state-of-the-art LLMs struggle with symbolic law induction and optimal experimental design under resource limitations.

MAD Physics: Formal Evaluation of Information-Seeking AI under Physical Constraints

Motivation and Benchmark Design

MAD Physics is constructed as a benchmark to rigorously evaluate AI agents' information-seeking behavior within resource-constrained physical environments. Unlike prior benchmarks, which are either unconstrained (allowing unlimited interventions) or biased toward static knowledge retrieval, MAD Physics forces agents to navigate tradeoffs between measurement quality, quantity, and cost—mimicking real scientific practice.

Environments in MAD Physics are parameterized around three distinct physical domains (Classical, Fluid, Quantum), each instantiated with altered laws to prevent agents from exploiting memorized priors. The core agent loop involves strategic measurement selection (time, target, precision), each incurring explicit costs, followed by prediction queries requiring inference of future system states. This setup directly tests model-based reasoning and planning under empirical budget constraints.

Environment: Structural and Operational Details

MAD Physics features:

Classical Mechanics: N-body systems in 2D/3D evolve under Newtonian laws, but with significant alterations such as anisotropic inertial mass tensors and nonstandard gravity (e.g., $1/r$ and “ripple” laws). Observations can be made on individual particles with precision-cost tradeoff.
Fluid Mechanics: 2D incompressible flow simulated on spectral grids, embedded with "alien" gyroscopic forces parameterized by kinetic/vorticity modulations, deviating from standard Navier-Stokes. Measurements target vorticity at selected spatial coordinates.
Quantum Mechanics: Two-particle systems with nontrivial entanglement initialization and generalized measurement norms ( $L_p$ -Born rule). Each measurement collapses the wavefunction; repeated trials evaluate information extraction despite destructive observation.

Each environment is designed for tractable simulation but leverages high-dimensional state and nontrivial alteration parameters. The agent receives no structural model specifications and is expected to infer dynamics ab initio, maximizing expected information gain per resource spent.

Empirical Evaluation: Gemini Model Analysis

The benchmark evaluation focused on Gemini 2.5 and Gemini 3. Flash LLMs. Agents operated in minimal scaffolds with code execution and two prompting strategies: a base empirical scientist prompt and a Bayesian strategic prompt emphasizing experimental design.

Quantitative Performance

Prediction error is reported using normalized RMSE for classical, L2 error for fluid/quantum domains. Gemini 3. Flash exhibited lower prediction error than Gemini 2.5 models across environments, but struggled on altered laws.
Prompt engineering: The Bayesian-inspired strategy prompt yielded improved inference performance, especially in altered environments, indicating sensitivity to explicit reasoning scaffolds.
Model progression: Performance increased monotonically with model size and capability (Flash Lite < Flash < Pro < Gemini 3), but even Gemini 3. Flash failed to consistently recover correct symbolic laws and was prone to inaccurate extrapolation.
Visual tasks: When agents operated only on visual observations, error magnitudes increased sharply, demonstrating limitations in multimodal scientific reasoning.
In-context learning: Gemini models failed to exhibit robust improvement over multiple episodic trials unless reinforced via prompt strategies.

Strong numerical failures were documented: agents often produced out-of-domain runaway predictions when overloaded with altered law parameters, showing lack of robust internal model reconfiguration under novel regimes.

Qualitative Trajectories

Agent reasoning revealed explicit structural hypothesis testing, iterative model refutation (e.g., systematically ruling out $1/r^2$ or $1/r$ laws for gravity), and fallback to empirically fit models (constant acceleration, local regression). Despite consistent Bayesian reasoning on measurement selection, agents could not reliably generalize symbolic model induction given sparse, noisy data.

Implications and Limitations

MAD Physics exposes critical gaps in current LLM-based scientific agents:

Failure modes: Agents are highly sensitive to priors and prompt structure; they frequently default to standard physics even when empirical evidence contradicts it.
Symbolic discovery: The inability to induce correct symbolic laws, especially for systems with subtle alterations, points to a fundamental limitation in abstract mechanistic reasoning.
Resource-aware planning: Bayesian prompt strategies induce marginally better exploration, but even advanced LLMs are not near optimal in experimental design or uncertainty quantification.

Practically, this benchmark demonstrates that "frontier" LLMs, even with code execution and prompt scaffolding, are not yet capable of agentic scientific inference under realistic resource constraints. Theoretical implications concern the integration of active Bayesian design principles, multimodal perception, and robust symbolic regression, which are not adequately handled by current architectures.

Future research directions include:

Scaffolding improvements: Modular tool integration (e.g., structured reasoning chains, external solvers) and specialized prompt strategies (AlphaEvolve, GEPA) (2605.10820).
Environment expansion: Incorporation of more complex or diverse physical systems (biological, chemical), hierarchical measurement selection, and real-world noise.
Cross-model evaluation: Systematic benchmarking across multiple LLM families beyond Gemini, with comparison on both empirical and symbolic reasoning metrics.
Principled experimental design: Embedding explicit Bayesian optimal design frameworks leveraging LLMs (Choudhury et al., 28 Aug 2025).

Conclusion

MAD Physics conclusively demonstrates that current LLM-based scientific agents are deficient in empirical model discovery, active experimental design, and strategic reasoning under physical constraints. The benchmark provides a rigorous testbed for the development and evaluation of future agentic systems in scientific domains, emphasizing the necessity of robust measurement planning, multimodal inference, and symbolic abstraction. It serves as a critical stepping stone in aligning AI development with the genuine demands and complexities of scientific inquiry.

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces MaD (Measuring and Discovering Physics), a set of computer-based challenges designed to test how well AI “scientist” agents can plan and make measurements in the real world when resources are limited. Instead of just answering quiz questions or running unlimited experiments, the AI must decide what to measure, when to measure it, and how precisely to measure it—because more precise measurements cost more. Then, using the data it collected, the AI has to figure out how the system works and predict what will happen next.

What questions were the authors trying to answer?

The authors focused on three simple questions:

Can an AI choose smart measurements when it has a limited “budget” (like having only a few coins or tickets to spend)?
Can it use those measurements to discover the hidden rules of a physical system and make good predictions?
How well do current LLM systems (like different versions of Gemini) handle this kind of “measure-then-predict” science task, especially when the physics is a bit unusual?

How does MaD work?

Think of MaD like a science fair with a twist. The AI stands in front of a machine (a simulated physical system) and gets a handful of “measurement tickets.” Each ticket can be used to take a measurement. High-quality measurements (less noise, more precision) cost more tickets. The AI must:

Measurement phase (spend the tickets wisely)

Pick what to look at (for example, the position of a particle or the swirliness of a fluid).
Pick when to look (which time during the system’s evolution).
Pick how precise the measurement should be (more precise = more expensive).
Each measurement has some noise (like a slightly blurry photo). The AI must balance quality and cost.

Prediction phase (use what it learned)

After using up the budget, the AI is asked to predict what the system will look like at a later time.
The AI’s predictions are graded by how close they are to the true answers. Lower error = better.

To prevent the AI from just recalling school physics, the authors sometimes change the rules of physics in small but important ways. This forces the AI to actually discover the rules from data.

The three kinds of physics environments

To keep things varied, MaD includes three different mini-worlds:

Classical mechanics (moving balls/particles)
- Normal version: particles move under forces like gravity.
- Altered versions:
- Directional inertia: it’s harder to accelerate in directions the particle recently moved (like invisible “resistance memory” that depends on past motion).
- Different gravity: gravity might fade with distance as 1/r (instead of the usual 1/r²), or wiggle with distance (gravity with tiny ripples).
Fluid mechanics (2D flowing fluid)
- Normal version: standard fluid flow with swirls and mixing.
- Altered versions:
- “Alien” sideways push: an extra force always sideways to the flow, controlled by either how fast the fluid is moving or how swirly it is. This creates unusual patterns in the fluid.
Quantum mechanics (two quantum particles in a box)
- Normal version: particles follow the standard quantum rules; measuring them affects the system.
- Altered versions:
- Unusual starting entanglement: the two particles start off with a special kind of link that depends on their distance.
- Different measurement rule: the way measurements turn wave behavior into probabilities is changed (so what you’re likely to see can be different from standard quantum theory).
- Because quantum measurements disturb the system, the AI can repeat the same setup multiple times to learn the probabilities.

In all environments, the AI’s job is the same: spend a limited budget to collect the most useful information, then predict the future state.

How they tested the AIs

The authors used several Gemini models (2.5 Flash Lite, 2.5 Flash, 2.5 Pro, and 3 Flash). Each AI had a simple setup that allowed it to:

Read a short description of the task.
Write and run small pieces of Python code to help think and compute.
Follow either a basic instruction prompt (Base) or a more structured “think like a scientist” prompt (Strategy) that encourages planning and uncertainty-aware measurement.

The AI’s performance was measured by how close its predictions were to the truth (lower error is better).

What did they find?

Here are the main takeaways from the experiments:

Stronger models usually did better. In general, the more capable models (like Gemini 3 Flash) made better predictions than smaller ones (like Gemini 2.5 Flash Lite).
A structured “scientist” strategy helped. When the AI was prompted to plan measurements in a more systematic way, it often performed better than with a basic prompt.
Weird physics made the task harder. When the rules were altered (like unusual gravity or quantum measurement rules), prediction errors tended to increase.
Some models made impossible predictions. In the classical setting, weaker models sometimes produced “out-of-bounds” results (like particles appearing where they simply couldn’t be), showing poor control of their reasoning.
Vision made it tougher. When the AI had to read measurements from images (instead of being given clean numbers), errors went up—showing that turning pictures into precise data adds difficulty.
Learning across repeated trials helped—sometimes. In an “in-context learning” version (same setup repeated a few times), Gemini 3 Flash improved across episodes and ended up with lower errors. Other models improved less, especially when physics was altered.
When asked to estimate a hidden parameter, the AI often assumed “normal physics.” In a version where the AI had to estimate how strong the altered inertia was, it tended to underestimate that strength—showing a bias toward standard rules.
More complicated systems were harder. With more particles, prediction errors grew. Moving from 2D to 3D did not change performance as much.
Writing down the exact formula was very hard. Even when predictions were okay, AIs rarely produced the correct symbolic equation for the underlying law.

Why is this important?

Science in the real world is limited by time, money, and tools. Good scientists don’t just take lots of measurements—they choose the right measurements. MaD pushes AI systems to do exactly that: plan what to measure, balance precision against cost, make sense of noisy data, and still predict the future.

The potential impact includes:

Better AI lab assistants that save time and money by focusing on the most informative measurements first.
More realistic tests for AI scientific reasoning, beyond trivia or unlimited experiments.
Clear directions for improvement: better planning strategies, better prompts and tools, and models that can adapt when nature’s rules aren’t what they expect.
A flexible framework that can expand to other sciences (like chemistry or biology) where smart measurements are crucial.

In short, MaD is a step toward AI that can “think like a scientist” under real-world constraints—deciding not just how to answer questions, but which questions to ask and how best to ask them.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a concise, actionable list of the main knowledge gaps, limitations, and open questions left unresolved by the paper. Future research can use these as concrete directions to extend and stress-test MaD and the evaluated agents.

Cost model is under-specified: the exact functional form of the measurement cost C(o, σ), how complexity of o_k translates to cost, and the specific budgets B used per task are not provided in the main text; no sensitivity analysis to cost curvature or budget scale is reported.
Budget and time horizon effects are not characterized: there is no systematic study of how performance scales with B, T_max, and the distribution of T_query (e.g., difficulty as a function of extrapolation gap beyond measurements).
Identifiability under constraints is unstudied: no theoretical or empirical analysis of when altered laws are distinguishable from standard physics given noisy, budgeted observations (e.g., minimum measurements or fidelities required for reliable discrimination).
No Bayes-optimal or principled planning baselines: classical active sensing / adaptive design algorithms (e.g., Bayesian optimization, GP-based experimental design, Kalman/particle filters, information-theoretic planners) are not included to contextualize LLM-agent performance.
Human expert baseline is absent: it remains unknown how trained scientists would perform under the same constraints and interfaces, and how far current models are from expert-level strategic measurement.
Limited statistical rigor: only three random seeds and N=5 prediction queries are used; no confidence intervals, variance decomposition, or statistical significance tests are reported.
Metric design is narrow: the benchmark uses point-error metrics (nRMSE, L2) only; uncertainty-aware scoring (e.g., proper scoring rules), calibration, and interval coverage are not evaluated, especially critical for probabilistic targets in the quantum setting.
Out-of-bounds predictions are not normalized or penalized consistently: allowing runaway predictions to dominate errors may confound comparisons across models; a principled treatment (e.g., bounded domains, robust metrics) is missing.
No decomposition of error sources: there is no analysis separating errors due to measurement selection, model inference, numerical solver inaccuracies, or perception (in the visual setting).
Measurement policy analysis is missing: the study does not report or release the agents’ chosen (t_k, o_k, σ_k) policies, information gain per unit cost, or ablations comparing simple heuristics (e.g., uniform-in-time sampling, always-high-fidelity) versus strategic policies.
Multi-fidelity trade-offs are not explored: although σ_k is part of the design, the paper does not analyze how agents use fidelity choices, how performance scales with noise, or how different cost-fidelity curves affect optimal planning.
Noise model is simplified: only Gaussian noise is used; robustness to non-Gaussian, biased, or heteroscedastic sensor noise (common in real instruments) is not evaluated.
Solver and numerical stability details are unclear: grid resolution, time-stepping, stability/courant conditions, and their effects on prediction targets and difficulty are not documented in the main text; sensitivity to numerical choices is unassessed.
Quantum measurement modeling is under-specified: details of collapse under generalized Born rules (p ≠ 2) and how repeated trials interact with budget constraints are not analyzed; the trade-off between informative but destructive measurements and prediction accuracy remains open.
Scope of domains is narrow: MaD focuses solely on classical/fluid/quantum physics; benchmarks for other sciences (e.g., chemistry, biology, materials) with real-world measurement constraints are not included.
Passive-only setting may limit realism: the benchmark forbids interventions; many real scientific workflows mix observation and controlled perturbations—hybrid tasks are not studied.
Difficulty calibration is not provided: there is no standardized progression (easy→hard) with provable need for strategic planning, nor baseline heuristics to establish a “floor” and “ceiling” of achievable performance.
Generalization and transfer are underexplored: cross-environment transfer, adaptation to unseen alterations, and meta-learning across tasks are only lightly touched (8-episode ICL), with no systematic evaluation of transferability or retention.
Visual-observation variant lacks diagnostics: the visual pipeline (rendering resolution, viewpoint, occlusions) and the split between perception error and downstream inference are not quantified; it is unclear whether poor performance is due to perception or planning.
Fairness and comparability across domains: budgets and costs may not be normalized for task difficulty, making it hard to compare performance across the three environments or across altered vs normal physics.
Model coverage is narrow: only Gemini models are evaluated; results may not generalize to other LLMs/VLMs or to non-LLM agents; no head-to-head comparisons are provided.
Scaffold effects are conflated with model ability: only a minimal code-execution scaffold is tested; the contribution of scaffold design versus base-model capability remains unresolved without orthogonal scaffold/model ablations.
Reproducibility gaps: key hyperparameters (budgets, cost curves, solver settings, seeds) are not enumerated in the main text; without them, replicating exact scores and error modes is difficult.
Security and leakage controls are unspecified: with code execution tools, the risk that agents access unintended environment internals is not discussed; sandboxing and API constraints are unclear.
Contamination audit is missing: while “altered laws” aim to reduce memorization, there is no systematic audit (e.g., can models exploit known analogies or benchmark artifacts?) or hidden test sets to track overfitting post-release.
Scalability limits are not charted: experiments use small N, 2D fluids, and two-particle quantum systems; scaling to larger systems, higher-dimensional PDEs, or longer horizons is untested.
Symbolic discovery evaluation is ad hoc: the paper notes frequent failures but lacks a formal metric, equivalence checking, or a task split specifically designed to evaluate structure recovery under budgeted measurement.
Planning under query-time uncertainty is untested: agents are not evaluated on anticipating different T_query distributions (e.g., heavy-tailed or far extrapolation), a key aspect of strategic measurement.
No explicit POMDP framing or RL baselines: while related work is cited, MaD is not connected to a formal POMDP objective with reward per information gain or per-cost efficiency, nor are RL agents evaluated.
Domain-randomization and robustness: robustness to distribution shift in initial conditions, boundary conditions, or force-field parameters is not assessed; agents’ brittleness to small variations is unknown.
Multi-agent or tool-augmented strategies are untested: coordinated roles (e.g., planner/modeler/analyst), physics-tool integration, or automated experimental-design toolchains are suggested but not evaluated within MaD.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are deployable applications that leverage MaD’s core ideas—cost–fidelity trade-offs, strategic measurement planning, and prediction from sparse/noisy observations—using existing tools and workflows.

AI scientific agent benchmarking and procurement
- Sector(s): Software, R&D, AI governance
- What: Use MaD as a standardized internal benchmark to compare LLM-based agents’ ability to plan measurements under budgets, infer models from data, and avoid “memorization” via altered-physics tasks.
- Tools/products/workflows: CI-style evaluation harness; scorecards for “measurement efficiency,” “prediction accuracy,” and “robustness to altered laws”; red-team suites that include visual observations and in-context learning (ICL) trials.
- Assumptions/dependencies: Access to MaD environments; reproducible seeds; ability to run agents with code-execution tools.
Prompt/scaffold optimization for scientific agents
- Sector(s): Software/AI tooling
- What: Apply Strategy-style prompts and prompt optimizers (e.g., AlphaEvolve, GEPA) to improve agents’ measurement planning and numerical inference on MaD tasks.
- Tools/products/workflows: Prompt optimization pipelines tied to MaD metrics; A/B testing of scaffolds (e.g., code-execution vs. tool-augmented).
- Assumptions/dependencies: Availability of model variants, prompt tooling, and automated evaluation loops.
Passive sensing planners for IoT/edge devices
- Sector(s): IoT, Smart buildings, Environmental monitoring
- What: Port MaD’s cost–fidelity measurement selection to select sampling rates, sensor modalities, and times under energy/bandwidth budgets.
- Tools/products/workflows: Edge libraries implementing budgeted active sensing; dashboards showing information gain per joule/byte.
- Assumptions/dependencies: Device APIs for adaptive sampling; calibrated noise and cost models.
Diagnostic test sequencing under cost/risk constraints
- Sector(s): Healthcare
- What: Decision support that plans which tests (and fidelity settings) to run to resolve diagnostic uncertainty efficiently (e.g., lab panels, imaging resolution).
- Tools/products/workflows: Clinical decision support modules that encode test costs, sensitivities/specificities as fidelity curves; EMR integration for patient-specific priors.
- Assumptions/dependencies: Validated cost–fidelity mappings for tests; regulatory review; clinician oversight.
Industrial inspection and metrology optimization
- Sector(s): Manufacturing, Robotics
- What: Schedule non-destructive testing (NDT) scans or robot inspection passes with variable fidelity (e.g., ultrasonic gain, scan density) to minimize time/cost while hitting detection thresholds.
- Tools/products/workflows: “Measurement Planner SDK” integrated into inspection robots; production dashboards showing defect-detection probability vs. budget.
- Assumptions/dependencies: Sensor noise/fidelity models; plant integration; safety constraints.
Remote sensing acquisition planning
- Sector(s): Agriculture, Forestry, Climate/Environment
- What: Choose when and what imagery to buy (spatial/spectral resolution, revisit frequency) to meet monitoring goals within budget.
- Tools/products/workflows: Satellite/drone acquisition planners that trade resolution vs. coverage; APIs to imagery marketplaces coupled with MaD-like planners.
- Assumptions/dependencies: Vendor APIs and pricing; weather/occlusion models; calibrated task utility functions.
Multimodal VLM evaluation with visual observations
- Sector(s): AI/Vision
- What: Test VLMs’ ability to extract quantitative measurements from rendered scenes and plan follow-up measurements.
- Tools/products/workflows: Visual MaD variants; image-to-number extraction performance tied to downstream prediction accuracy.
- Assumptions/dependencies: Reliable visual-to-numeric inference; rendering fidelity aligned with target domains.
In-context learning (ICL) performance audits
- Sector(s): AI research/quality assurance
- What: Use MaD’s repeated-trial protocol to quantify whether agents improve measurement strategies over episodes with identical initializations.
- Tools/products/workflows: ICL scorecards tracking convergence of prediction error; bias detection (e.g., bias toward standard physics).
- Assumptions/dependencies: Stable agent versions; reproducible episode resets; sufficient trial budgets.
Education: teaching experimental design under constraints
- Sector(s): Education (STEM, data science)
- What: Integrate MaD labs to teach students how to trade off measurement cost vs. fidelity and plan informative observations.
- Tools/products/workflows: Classroom modules; altered-physics “anti-memorization” exercises; rubrics for plan quality vs. outcomes.
- Assumptions/dependencies: Accessible UIs/simulators; instructor guides; compute availability.
AI safety and robustness probing
- Sector(s): AI safety, Compliance
- What: Use altered laws to diagnose memorization, brittle reliance on canonical physics, and failure modes in model-based reasoning.
- Tools/products/workflows: Robustness benchmarks; adversarial config sweeps (e.g., stronger alterations, more particles).
- Assumptions/dependencies: Diverse scenario libraries; organizational buy-in to adopt robustness criteria.

Long-Term Applications

The following applications require further research, integration with physical systems, scaling, or regulatory pathways before broad deployment.

Autonomous lab assistants for budgeted experimentation
- Sector(s): Materials science, Chemistry, Biology
- What: Closed-loop agents that plan multi-fidelity measurements (e.g., microscopy resolution, assay depth) to discover models/hypotheses within time/money constraints.
- Tools/products/workflows: Lab orchestration platforms linked to MaD-like planners; automated experimental design with safety gates.
- Assumptions/dependencies: Reliable lab robotics; validated cost–fidelity curves; oversight and auditability.
Clinical adaptive testing and resource allocation
- Sector(s): Healthcare
- What: Systems that personalize diagnostic pathways (test choice and fidelity/invasiveness) to maximize information per cost/risk for each patient.
- Tools/products/workflows: Hospital-grade decision support; calibration on prospective studies; explainability tools for clinicians and regulators.
- Assumptions/dependencies: Clinical trials; regulatory approval; integration with EMRs and payer policies.
Quantum experiment design and Hamiltonian learning
- Sector(s): Quantum computing/physics
- What: Measurement sequencing that accounts for state disturbance and non-standard measurement rules; efficient inference of system parameters with limited shots.
- Tools/products/workflows: Quantum-lab copilots integrating MaD’s quantum module; shot budget optimizers; collapse-aware planning.
- Assumptions/dependencies: Access to quantum hardware; realistic noise models; stability under experimental drift.
Large-scale sensor network scheduling for critical infrastructure
- Sector(s): Energy, Water, Transportation
- What: Adaptive sampling of grid/pipe/traffic sensors to maximize fault detection and forecasting accuracy under bandwidth/energy budgets.
- Tools/products/workflows: Network controllers with MaD-derived planners; anomaly-driven fidelity escalation; cyber-physical security interfaces.
- Assumptions/dependencies: Field validation; interoperability with OT systems; resilience and safety certifications.
CFD/meteorology data assimilation with budgeted observations
- Sector(s): Climate, Aerospace, Weather services
- What: Decide where and when to sample (e.g., UAV transects, buoy deployments) to reduce forecast uncertainty while minimizing cost.
- Tools/products/workflows: Mission planners that use MaD’s fluid mechanics analogs; twin-model assimilation loops.
- Assumptions/dependencies: High-fidelity forward models; logistics and operational constraints; evaluation on real campaigns.
Agricultural precision sampling
- Sector(s): AgriTech
- What: Optimize soil/leaf sampling, drone flights, and lab test fidelity to guide interventions (irrigation, fertilization) within budgets.
- Tools/products/workflows: Farm management software with measurement planners; field trial workflows.
- Assumptions/dependencies: Agronomic models; seasonal/weather variability; cost–benefit calibration.
Space and planetary exploration
- Sector(s): Robotics, Space
- What: Rover/drone measurement strategies that maximize scientific return under severe energy/time constraints without disturbing sensitive environments (passive sensing first).
- Tools/products/workflows: Onboard planners implementing MaD’s measurement–prediction loop; ground-in-the-loop oversight.
- Assumptions/dependencies: Radiation-hardened compute; autonomy validation; mission safety protocols.
Regulatory certification and standards for AI scientific agents
- Sector(s): Policy, Standards bodies
- What: MaD-inspired conformance tests that certify agents’ capacity to plan under constraints, avoid memorization, and produce reliable predictions.
- Tools/products/workflows: Public test suites; performance thresholds by sector (e.g., healthcare vs. manufacturing).
- Assumptions/dependencies: Multi-stakeholder adoption; governance frameworks; transparency requirements.
Finance and insurance risk inspections
- Sector(s): Finance, InsurTech, Real estate
- What: Budget-aware inspection planning (e.g., property/drone surveys at variable fidelity) to estimate risk with minimal cost.
- Tools/products/workflows: Underwriting copilots; inspection scheduling optimizers; post-inspection uncertainty accounting.
- Assumptions/dependencies: Regulatory compliance; calibrated risk models; data-sharing agreements.
Human-in-the-loop experiment planning UIs
- Sector(s): Software, R&D platforms
- What: Interfaces that expose cost–fidelity trade-offs and suggested measurement sequences, allowing researchers to adjust plans and constraints.
- Tools/products/workflows: Interactive planners with Bayesian experimental design primitives; explanation panels for expected information gain.
- Assumptions/dependencies: Usability testing; transparent uncertainty estimates; training for practitioners.
Multi-agent collaborative discovery workflows
- Sector(s): Research consortia, Pharma, Materials
- What: Teams of agents dividing measurement budgets across labs or instruments to collectively reduce uncertainty faster.
- Tools/products/workflows: Coordination protocols; shared priors/posteriors; budget-aware task assignment.
- Assumptions/dependencies: Data standardization; IP/privacy constraints; robust synchronization.
Pretraining and evaluation with “altered-law” curricula
- Sector(s): AI research
- What: Incorporate altered-physics tasks as anti-memorization curricula for pretraining and as routine stress tests during model validation.
- Tools/products/workflows: Curriculum generators varying alteration strength and system complexity; automated difficulty scaling.
- Assumptions/dependencies: Compute budgets; transferability to real tasks; coverage of relevant phenomena.

Notes on feasibility and dependencies across applications:

Accurate cost–fidelity models are critical and domain-specific; real-world deployment requires careful calibration and validation.
Safety, reliability, and regulatory compliance are major gating factors in healthcare, infrastructure, and autonomous systems.
Transfer from simulated MaD environments to physical systems benefits from high-fidelity forward models and domain-adapted observation functions.
Tooling integration (code execution, simulators, lab/field APIs) and interpretability/explainability are necessary for adoption and oversight.

View Paper Prompt View All Prompts

Glossary

1/R gravity: A modified gravitational law where force decays proportionally to 1/r rather than 1/r^2. "We refer to \Cref{eq:1r_grav} as 1/R"
Active sensing: Sequentially choosing measurements to maximize information about unknown parameters without intervening on the system. "Active sensing studies the problem of sequentially choosing informative measurements"
Adaptive experimental design: Planning and selecting experiments dynamically to infer model parameters or structure efficiently. "adaptive experimental design"
Anisotropic inertia: Direction-dependent resistance to acceleration, differing across spatial directions. "classical mechanics with anisotropic inertia"
Anisotropic inertial mass tensor: A matrix-valued mass that modifies acceleration differently along directions and can depend on motion history. "anisotropic inertial mass tensor $\mathbf{M}_i(t)$ "
Bayesian experimental design: A framework that chooses experiments to maximize expected information gain under uncertainty. "inspired by Bayesian experimental design"
Convex combination: A weighted sum of components with nonnegative weights that sum to one. "a convex combination of the velocity and vorticity modulation variants."
Coriolis forces: Apparent forces arising in rotating frames; nonlinear variants can modify fluid dynamics. "nonlinear Coriolis forces"
Euclidean space: The standard flat geometric space of D dimensions. "a $D$ -dimensional Euclidean space"
Gaussian noise: Normally distributed random perturbations added to measurements or data. "corrupted by Gaussian noise"
Generalized Born rule: A nonstandard quantum measurement rule using |Ψ|^p instead of |Ψ|² to define probabilities. "Generalized Born Rule."
Generalized probability measure: A probability assignment deviating from the standard L2 norm, often using an Lp norm. "a generalized probability measure based on the $L_p$ -norm"
Generalized state variable: An abstract representation of a system’s time-evolving state. "a generalized state variable $s(t)$ "
Gyroscopic forcing term: A force acting perpendicular to velocity, used here to inject or modulate vorticity. "``alien'' gyroscopic forcing term"
Hamiltonian: The operator encoding the total energy that governs quantum time evolution. "The Hamiltonian $\hat{H}$ includes"
Incompressible viscous fluid: A fluid with zero divergence (constant density) and internal friction (viscosity). "an incompressible viscous fluid"
In-context learning: Improving task performance across trials by using prior interactions in the context rather than changing model parameters. "in-context learning."
Inertial memory: A history-dependent effect where past accelerations influence current effective inertia. "``inertial memory''"
Joint probability density: A probability distribution over multiple variables (e.g., positions of two particles). "the joint probability density is given by"
Kelvin-Helmholtz instability: Shear-driven instability at the interface of fluid layers that produces vortical structures. "Kelvin-Helmholtz instability"
Kinematic pressure: The pressure term in Navier–Stokes equations associated with enforcing incompressibility. "where $p$ is the kinematic pressure"
Kinematic viscosity: Momentum diffusivity of a fluid (viscosity divided by density). "and $\nu$ is the kinematic viscosity."
L2 error: An error metric based on the Euclidean norm of differences between predictions and truths. "We use the $L_2$ error"
Lp-norm: A family of vector norms parameterized by p, generalizing Euclidean (p=2) and other norms. "the $L_p$ -norm"
Marginal densities: Probability densities of a subset of variables obtained by integrating out others. "according to marginal densities"
Navier–Stokes equations: Fundamental partial differential equations governing viscous fluid flow. "Navier-Stokes equations"
Normalized root mean square error (nRMSE): RMSE scaled by a reference magnitude to enable comparable errors. "normalized root mean square error (nRMSE)"
Periodic domain: A spatial domain with wraparound boundaries so fields repeat across edges. "two-dimensional periodic domain"
POMDP: A Partially Observable Markov Decision Process, where the agent must act with incomplete state information. "as a POMDP"
Ripple (altered gravity): An inverse-square gravity modified by a sinusoidal ripple in distance. "We refer to \Cref{eq:ripple_grav} as Ripple."
Schrödinger equation: The fundamental time-dependent equation governing quantum wavefunction evolution. "SchrÃ¶dinger equation"
Smoothed infinite square wells: Confining potentials approximating infinitely hard walls with smooth transitions. "smoothed infinite square wells"
Symbolic regression: Discovering analytic expressions that fit data, typically by searching over symbolic forms. "symbolic regression"
Variational quantum circuit: A parameterized quantum circuit optimized (often classically) for a target objective. "Variational Quantum Circuit Design"
Velocity Modulation: A forcing scheme where modulation depends on local speed or kinetic energy. "Velocity Modulation:"
Vorticity: A measure of local rotation in a fluid; in 2D it is a scalar curl of velocity. "vorticity"
Vorticity Modulation: A forcing scheme where modulation depends on the local vorticity field. "Vorticity Modulation:"
Wavefunction: A complex-valued function whose magnitude and phase encode quantum state amplitudes. "wavefunction"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

MaD Physics: Evaluating information seeking under constraints in physical environments

Summary

MAD Physics: Formal Evaluation of Information-Seeking AI under Physical Constraints

Motivation and Benchmark Design

Environment: Structural and Operational Details

Empirical Evaluation: Gemini Model Analysis

Quantitative Performance

Qualitative Trajectories

Implications and Limitations

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were the authors trying to answer?

How does MaD work?

The three kinds of physics environments

How they tested the AIs

What did they find?

Why is this important?

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets