Papers
Topics
Authors
Recent
Search
2000 character limit reached

MaD Physics: A Measurement-First Approach

Updated 4 July 2026
  • MaD Physics is a benchmark that evaluates agents’ ability to plan informative measurements under strict resource and cost constraints.
  • It integrates active sensing, constrained experimental design, and adaptive inference to extract underlying physical laws from noisy, sparse observations.
  • The methodology applies to both controlled benchmarks and practical lab workflows, enabling interpretable model discovery through measurement-first strategies.

Measuring and Discovering Physics (MaD Physics) denotes a measurement-centered approach to scientific inference under explicit resource constraints. In its most explicit formulation, MaD Physics is a benchmark for evaluating whether an agent can make informative measurements and conclusions subject to constraints on the quality and quantity of measurements, with the goal of inferring an underlying physical law and predicting future states (Jain et al., 11 May 2026). A plausible broader reading, supported by related uses of the term in laboratory pedagogy, flavor physics, and hybrid machine-learning workflows, is a measurement-first methodology in which discovery depends jointly on sensing, model induction, and validation against physically meaningful targets (Gillen et al., 2022).

1. Definition and conceptual scope

MaD Physics was introduced to address a specific gap in the evaluation of scientific agents. The central claim is that scientific discovery is fundamentally a resource-constrained process: measurements have costs in time, money, or physical impact, and higher fidelity typically costs more. Existing benchmarks were described as focusing either on static knowledge-based reasoning or on unconstrained experimental design, and therefore as failing to capture the coupled problem of measuring, planning, and inferring under hard constraints (Jain et al., 11 May 2026).

The benchmark formalizes two fundamental capabilities. The first is planning informative measurements under a hard budget. The second is inferring an unknown dynamical model from noisy, sparse data. To mitigate contamination from existing knowledge, the benchmark includes altered physical laws rather than only canonical textbook dynamics. This design choice is not incidental: it shifts evaluation away from recall of standard formulas and toward adaptive experimentation and model construction (Jain et al., 11 May 2026).

A broader interpretation of MaD Physics is suggested elsewhere in the literature. In a smartphone magnetometry study, the phrase “MaD Physics connection” is used to describe a workflow in which a phone, paper, and textbook are sufficient to calibrate sensors, relate coordinate axes to Earth, collect data, perform non-linear curve fitting, extract physical constants, and validate fundamental laws in an everyday context (Gillen et al., 2022). In nuclear-structure work, the “MaD Physics workflow” denotes a two-stage process in which numerical regression produces smooth surrogates and symbolic regression then “white-boxes” those surrogates into interpretable expressions (Maheshwari et al., 7 Dec 2025). This suggests that MaD Physics is not confined to one benchmark, but characterizes a recurring structure: measurement selection, constrained inference, and interpretable law discovery.

2. Formal problem structure

In MaD Physics, each observation is described by a tuple (tk,ok,σk)(t_k,o_k,\sigma_k), where tkt_k is the time of the kkth measurement, oko_k is an observation function, and σk\sigma_k is a noise scale. The returned data obeys

yk=ok(s(tk))+εk,εkN(0,σk2).y_k = o_k\bigl(s(t_k)\bigr)+\varepsilon_k,\qquad \varepsilon_k\sim\mathcal{N}(0,\sigma_k^2).

Each measurement incurs a cost C(ok,σk)C(o_k,\sigma_k), strictly increasing in fidelity, and the full measurement sequence must satisfy the budget constraint

k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.

After the measurement phase, a random query time Tquery>TmaxT_{\rm query}>T_{\max} is drawn, and the agent must predict a target function of the future state (Jain et al., 11 May 2026).

The inference problem is therefore inseparable from experimental design. The benchmark gives a natural weighted least-squares estimator,

θ^=argminθ  k=1K[ykok(sθ(tk))]2σk2,\hat\theta=\arg\min_\theta\;\sum_{k=1}^K\frac{\bigl[y_k - o_k\bigl(s_\theta(t_k)\bigr)\bigr]^2}{\sigma_k^2},

where tkt_k0 is the state trajectory under a candidate dynamical model tkt_k1. The planning problem is cast as a POMDP-like trade-off in which each potential measurement has both cost and expected information gain. The benchmark writes this in terms of posterior-entropy reduction,

tkt_k2

This makes MaD Physics a joint test of active sensing and scientific induction rather than a pure regression task (Jain et al., 11 May 2026).

A common misconception is that the task is merely to fit parameters once data have been collected. The formal definition contradicts this. The benchmark score depends on which measurements were chosen, at what fidelity, and in what sequence, because the budget is exhausted before the final prediction phase begins. MaD Physics therefore evaluates model inference and constrained exploration simultaneously (Jain et al., 11 May 2026).

3. Benchmark environments and altered laws

The benchmark comprises three environments, each based on a distinct physical law and each admitting altered variants designed to frustrate memorization (Jain et al., 11 May 2026).

Environment Standard law and observations Alterations and prediction target
Classical mechanics tkt_k3 spherical bodies in tkt_k4 dimensions with pairwise gravitational forces Anisotropic inertial mass tensor; “1/R” gravity; “Ripple” gravity; predict future positions
Fluid mechanics 2D incompressible Navier–Stokes on a periodic box; observations are vorticity at chosen tkt_k5 Alien gyroscopic forcing with velocity modulation, vorticity modulation, or convex combination; predict future vorticity
Quantum mechanics Two particles in a 2D infinite well; observations collapse the wavefunction Nonlinear entanglement initialization; generalized Born rule with tkt_k6; predict future probabilities in query regions

In the classical environment, the standard law is modified in two ways. One is an anisotropic inertial mass tensor,

tkt_k7

The other is modified gravity, including a “1/R” law,

tkt_k8

and a “Ripple” law,

tkt_k9

The prediction metric is a normalized RMSE with kk0 defined as the box diagonal length (Jain et al., 11 May 2026).

In the fluid environment, the base dynamics are 2D incompressible Navier–Stokes with Kelvin–Helmholtz initial conditions. The altered term is an alien gyroscopic forcing

kk1

with either velocity modulation kk2, vorticity modulation kk3, or a convex combination. Performance is measured by an kk4 error on vorticity (Jain et al., 11 May 2026).

In the quantum environment, the standard law is a two-particle Schrödinger evolution in a 2D infinite well. Alterations include nonlinear entanglement initialization,

kk5

and a generalized Born rule,

kk6

Measurements themselves collapse the wavefunction, so the agent may repeat trials on identical initializations. The final target is the probability that a particle lies in a query region at kk7 (Jain et al., 11 May 2026).

4. Empirical findings on current agents

The initial benchmark study evaluated four Gemini models: Gemini 2.5 Flash Lite, Gemini 2.5 Flash, Gemini 2.5 Pro, and Gemini 3 Flash. The measured results show that frontier LLMs can perform nontrivial inference, but also expose clear weaknesses in structured exploration and data collection (Jain et al., 11 May 2026).

In classical mechanics with normal physics, Gemini 2.5 Flash Lite often produced runaway, out-of-bounds predictions. Gemini 2.5 Pro achieved approximately kk8 nRMSE in the base setting and kk9 with a Strategy prompt. Gemini 3 Flash achieved approximately oko_k0 in the base setting and oko_k1 with Strategy, while remaining reliably in-bounds. On altered classical laws, all models degraded, and the study reports no consistent trend with alteration strength (Jain et al., 11 May 2026).

In fluid mechanics, reported errors ranged from about oko_k2 for Gemini 2.5 Pro to about oko_k3 for Gemini 3 Flash. Strategy prompting typically reduced error by oko_k4–oko_k5. In quantum mechanics, reported errors were about oko_k6 for Gemini 2.5 Pro to about oko_k7 for Gemini 3 Flash on normal laws, with worse performance on non-standard oko_k8 norms and entanglement (Jain et al., 11 May 2026).

The most important negative result is symbolic rather than numeric. Even when structured prompting improved predictive accuracy, agents still failed to identify correct symbolic laws in all but the simplest altered tasks. The benchmark also documents parameter-fit bias: in classical anisotropic inertia, agents consistently underestimated the inertial-tensor coupling oko_k9. These findings make clear that successful short-horizon prediction does not imply faithful recovery of the underlying physical mechanism (Jain et al., 11 May 2026).

5. Measurement-centered laboratory practice

Outside the benchmark itself, MaD Physics is also used to describe laboratory workflows in which discovery is organized around low-cost sensing and explicit data analysis. A smartphone magnetometry experiment provides a particularly compact example. One activity uses the phone’s three-dimensional sensors to determine the vector components and orientation of the background magnetic field at the measurement location and compares the resulting field to NOAA’s Magnetic Field Calculator. A second activity measures the axial field of the small magnet in an earbud speaker and fits the data with

σk\sigma_k0

The reported results place measured σk\sigma_k1, σk\sigma_k2, and σk\sigma_k3 within a few percent of NOAA values, with ring-magnet fits yielding σk\sigma_k4–σk\sigma_k5 mm, σk\sigma_k6–σk\sigma_k7 mm, and reduced σk\sigma_k8 (Gillen et al., 2022).

Muon-detector projects extend the same logic to particle physics. The “Desktop Muon Detector” is described as a self-contained apparatus using a plastic scintillator and a silicon photomultiplier, with total cost approximately \$\sigma_k$9%%%%5$(t_k,o_k,\sigma_k)$5$t_k$5$k$5$o_k$5$\sigma_k$5%%%%50.3$y_k = o_k\bigl(s(t_k)\bigr)+\varepsilon_k,\qquad \varepsilon_k\sim\mathcal{N}(0,\sigma_k^2).$60.5$y_k = o_k\bigl(s(t_k)\bigr)+\varepsilon_k,\qquad \varepsilon_k\sim\mathcal{N}(0,\sigma_k^2).$70.02$y_k = o_k\bigl(s(t_k)\bigr)+\varepsilon_k,\qquad \varepsilon_k\sim\mathcal{N}(0,\sigma_k^2).$80.1$y_k = o_k\bigl(s(t_k)\bigr)+\varepsilon_k,\qquad \varepsilon_k\sim\mathcal{N}(0,\sigma_k^2).922%±7%922\%\pm7\%C(o_k,\sigma_k)035,000035{,}000 ft over the geomagnetic equator (Axani et al., 2016, Axani, 2019).

These examples show a version of MaD Physics in which the constraint is not a simulated benchmark budget but the practical limitation of instrumentation, cost, and student-facing experimental infrastructure. The common structure remains the same: choose observables, calibrate measurement channels, fit physically motivated models, and use discrepancies or residuals to refine interpretation (Gillen et al., 2022).

6. Computational discovery workflows and interpretable laws

A distinct but related MaD Physics usage appears in hybrid machine-learning pipelines for scientific law discovery. In work on nuclear charge radii, the workflow begins with numerical regression using Light Gradient Boosting Machine and Gaussian Process Regression, both trained with four-fold cross-validation and automated hyperparameter optimization. On out-of-fold predictions over C(ok,σk)C(o_k,\sigma_k)1 nuclei with measured radii, the reported RMSE values are C(ok,σk)C(o_k,\sigma_k)2 fm for LGBM and C(ok,σk)C(o_k,\sigma_k)3 fm for GPR. Symbolic regression is then performed on C(ok,σk)C(o_k,\sigma_k)4 ML-predicted and extrapolated points using PySRRegressor, with a Pareto frontier used to balance MSE against expression complexity (Maheshwari et al., 7 Dec 2025).

The resulting expressions are explicit. At complexity level C(ok,σk)C(o_k,\sigma_k)5, the GPR-distilled formula is

C(ok,σk)C(o_k,\sigma_k)6

with RMSE C(ok,σk)C(o_k,\sigma_k)7 fm, while the LGBM-distilled formula is

C(ok,σk)C(o_k,\sigma_k)8

with RMSE C(ok,σk)C(o_k,\sigma_k)9 fm. At complexity k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.0, both distilled families recover nearly the full numerical accuracy, at about k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.1 fm RMSE. The paper explicitly presents this as a MaD Physics cycle: numerical regression produces precise surrogate measurements, and symbolic regression then exposes interpretable physics expressions involving k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.2, k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.3, binding energy, isospin asymmetry, Casten factor, and nonlinear composite terms (Maheshwari et al., 7 Dec 2025).

Related research pursues comparable discovery goals with different measurement front ends. “Visual Physics” uses bounding-box trajectories extracted by Mask R-CNN, a k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.4-VAE latent model, and Eureqa-based symbolic regression to recover textbook laws such as free fall and uniform circular motion from video. In free fall, the study reports latent correlations with initial velocities of at least k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.5, recovery of k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.6 in synthetic data, and real-world recovery of k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.7 within k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.8 on hand-tossed basketball videos (Chari et al., 2019). “Physics-Informed Müntz-Szász Networks” make power-law exponents trainable and explicit; in a k=1KC(ok,σk)B.\sum_{k=1}^K C\bigl(o_k,\sigma_k\bigr)\le B.9-configuration wedge benchmark, constraint-aware training attains a Tquery>TmaxT_{\rm query}>T_{\max}0 success rate with Tquery>TmaxT_{\rm query}>T_{\max}1 mean error, whereas naive training succeeds in Tquery>TmaxT_{\rm query}>T_{\max}2 of cases with mean error Tquery>TmaxT_{\rm query}>T_{\max}3 (N'guessan et al., 30 Jan 2026).

A plausible implication is that MaD Physics, in its broader methodological sense, increasingly denotes a pipeline in which measurements are not only collected but structured to support interpretable inference. The front end may be a phone magnetometer, a scintillator counter, a rendered physical environment, or a learned surrogate model; the back end is a law-discovery step constrained by physical compatibility (Maheshwari et al., 7 Dec 2025).

7. Significance, misconceptions, and future directions

MaD Physics is significant because it joins three research problems that are often separated: active measurement, model discovery, and physically grounded evaluation. The benchmark formulation makes explicit that informative sensing under budget is part of scientific competence, not merely a prelude to it. This directly counters the misconception that discovery benchmarks can be reduced to symbolic regression on a fixed dataset (Jain et al., 11 May 2026).

The phrase also appears in high-energy theory-experiment synergy. In rare kaon decays, the “Synergy of theory & experiment (‘MaD Physics’)” is described as the combination of lattice-QCD reduction of uncertainties in Tquery>TmaxT_{\rm query}>T_{\max}4, Tquery>TmaxT_{\rm query}>T_{\max}5, and Tquery>TmaxT_{\rm query}>T_{\max}6 with NA62, KOTO, and successor measurements. Under this program, Tquery>TmaxT_{\rm query}>T_{\max}7 and Tquery>TmaxT_{\rm query}>T_{\max}8 are projected to probe new-physics scales up to about Tquery>TmaxT_{\rm query}>T_{\max}9 TeV when theory and experiment reach the few-percent level (Blum et al., 2022). This use of the term is not benchmark-oriented, but it reinforces the same underlying logic: measurement capability and theoretical identifiability advance together.

The benchmark authors outline several future directions. These include richer tool sets such as advanced code execution, plotting, and symbolic-algebra tools; prompt-engineering methods such as AlphaEvolve or GEPA; evaluation of non-Gemini frontier models; new environments involving more complex or higher-dimensional physical systems and other scientific domains; and formal Bayesian-design integration through acquisition functions such as mutual-information estimators. The benchmark can also be configured for multimodal reasoning by replacing numerical observations with rendered visual observations, although the initial experiments report substantially higher errors in that mode (Jain et al., 11 May 2026).

Taken together, the literature presents MaD Physics as both a specific benchmark and a broader research pattern. In the narrow sense, it is an extensible testbed for evaluating agents that must measure, plan, and infer under constraints. In the broader sense, it names a style of physics research in which discovery is driven by the disciplined coupling of measurements, cost-aware experimental choices, and interpretable models (Jain et al., 11 May 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Measuring and Discovering Physics (MaD Physics).