MaD Physics: A Measurement-First Approach
- MaD Physics is a benchmark that evaluates agents’ ability to plan informative measurements under strict resource and cost constraints.
- It integrates active sensing, constrained experimental design, and adaptive inference to extract underlying physical laws from noisy, sparse observations.
- The methodology applies to both controlled benchmarks and practical lab workflows, enabling interpretable model discovery through measurement-first strategies.
Measuring and Discovering Physics (MaD Physics) denotes a measurement-centered approach to scientific inference under explicit resource constraints. In its most explicit formulation, MaD Physics is a benchmark for evaluating whether an agent can make informative measurements and conclusions subject to constraints on the quality and quantity of measurements, with the goal of inferring an underlying physical law and predicting future states (Jain et al., 11 May 2026). A plausible broader reading, supported by related uses of the term in laboratory pedagogy, flavor physics, and hybrid machine-learning workflows, is a measurement-first methodology in which discovery depends jointly on sensing, model induction, and validation against physically meaningful targets (Gillen et al., 2022).
1. Definition and conceptual scope
MaD Physics was introduced to address a specific gap in the evaluation of scientific agents. The central claim is that scientific discovery is fundamentally a resource-constrained process: measurements have costs in time, money, or physical impact, and higher fidelity typically costs more. Existing benchmarks were described as focusing either on static knowledge-based reasoning or on unconstrained experimental design, and therefore as failing to capture the coupled problem of measuring, planning, and inferring under hard constraints (Jain et al., 11 May 2026).
The benchmark formalizes two fundamental capabilities. The first is planning informative measurements under a hard budget. The second is inferring an unknown dynamical model from noisy, sparse data. To mitigate contamination from existing knowledge, the benchmark includes altered physical laws rather than only canonical textbook dynamics. This design choice is not incidental: it shifts evaluation away from recall of standard formulas and toward adaptive experimentation and model construction (Jain et al., 11 May 2026).
A broader interpretation of MaD Physics is suggested elsewhere in the literature. In a smartphone magnetometry study, the phrase “MaD Physics connection” is used to describe a workflow in which a phone, paper, and textbook are sufficient to calibrate sensors, relate coordinate axes to Earth, collect data, perform non-linear curve fitting, extract physical constants, and validate fundamental laws in an everyday context (Gillen et al., 2022). In nuclear-structure work, the “MaD Physics workflow” denotes a two-stage process in which numerical regression produces smooth surrogates and symbolic regression then “white-boxes” those surrogates into interpretable expressions (Maheshwari et al., 7 Dec 2025). This suggests that MaD Physics is not confined to one benchmark, but characterizes a recurring structure: measurement selection, constrained inference, and interpretable law discovery.
2. Formal problem structure
In MaD Physics, each observation is described by a tuple , where is the time of the th measurement, is an observation function, and is a noise scale. The returned data obeys
Each measurement incurs a cost , strictly increasing in fidelity, and the full measurement sequence must satisfy the budget constraint
After the measurement phase, a random query time is drawn, and the agent must predict a target function of the future state (Jain et al., 11 May 2026).
The inference problem is therefore inseparable from experimental design. The benchmark gives a natural weighted least-squares estimator,
where 0 is the state trajectory under a candidate dynamical model 1. The planning problem is cast as a POMDP-like trade-off in which each potential measurement has both cost and expected information gain. The benchmark writes this in terms of posterior-entropy reduction,
2
This makes MaD Physics a joint test of active sensing and scientific induction rather than a pure regression task (Jain et al., 11 May 2026).
A common misconception is that the task is merely to fit parameters once data have been collected. The formal definition contradicts this. The benchmark score depends on which measurements were chosen, at what fidelity, and in what sequence, because the budget is exhausted before the final prediction phase begins. MaD Physics therefore evaluates model inference and constrained exploration simultaneously (Jain et al., 11 May 2026).
3. Benchmark environments and altered laws
The benchmark comprises three environments, each based on a distinct physical law and each admitting altered variants designed to frustrate memorization (Jain et al., 11 May 2026).
| Environment | Standard law and observations | Alterations and prediction target |
|---|---|---|
| Classical mechanics | 3 spherical bodies in 4 dimensions with pairwise gravitational forces | Anisotropic inertial mass tensor; “1/R” gravity; “Ripple” gravity; predict future positions |
| Fluid mechanics | 2D incompressible Navier–Stokes on a periodic box; observations are vorticity at chosen 5 | Alien gyroscopic forcing with velocity modulation, vorticity modulation, or convex combination; predict future vorticity |
| Quantum mechanics | Two particles in a 2D infinite well; observations collapse the wavefunction | Nonlinear entanglement initialization; generalized Born rule with 6; predict future probabilities in query regions |
In the classical environment, the standard law is modified in two ways. One is an anisotropic inertial mass tensor,
7
The other is modified gravity, including a “1/R” law,
8
and a “Ripple” law,
9
The prediction metric is a normalized RMSE with 0 defined as the box diagonal length (Jain et al., 11 May 2026).
In the fluid environment, the base dynamics are 2D incompressible Navier–Stokes with Kelvin–Helmholtz initial conditions. The altered term is an alien gyroscopic forcing
1
with either velocity modulation 2, vorticity modulation 3, or a convex combination. Performance is measured by an 4 error on vorticity (Jain et al., 11 May 2026).
In the quantum environment, the standard law is a two-particle Schrödinger evolution in a 2D infinite well. Alterations include nonlinear entanglement initialization,
5
and a generalized Born rule,
6
Measurements themselves collapse the wavefunction, so the agent may repeat trials on identical initializations. The final target is the probability that a particle lies in a query region at 7 (Jain et al., 11 May 2026).
4. Empirical findings on current agents
The initial benchmark study evaluated four Gemini models: Gemini 2.5 Flash Lite, Gemini 2.5 Flash, Gemini 2.5 Pro, and Gemini 3 Flash. The measured results show that frontier LLMs can perform nontrivial inference, but also expose clear weaknesses in structured exploration and data collection (Jain et al., 11 May 2026).
In classical mechanics with normal physics, Gemini 2.5 Flash Lite often produced runaway, out-of-bounds predictions. Gemini 2.5 Pro achieved approximately 8 nRMSE in the base setting and 9 with a Strategy prompt. Gemini 3 Flash achieved approximately 0 in the base setting and 1 with Strategy, while remaining reliably in-bounds. On altered classical laws, all models degraded, and the study reports no consistent trend with alteration strength (Jain et al., 11 May 2026).
In fluid mechanics, reported errors ranged from about 2 for Gemini 2.5 Pro to about 3 for Gemini 3 Flash. Strategy prompting typically reduced error by 4–5. In quantum mechanics, reported errors were about 6 for Gemini 2.5 Pro to about 7 for Gemini 3 Flash on normal laws, with worse performance on non-standard 8 norms and entanglement (Jain et al., 11 May 2026).
The most important negative result is symbolic rather than numeric. Even when structured prompting improved predictive accuracy, agents still failed to identify correct symbolic laws in all but the simplest altered tasks. The benchmark also documents parameter-fit bias: in classical anisotropic inertia, agents consistently underestimated the inertial-tensor coupling 9. These findings make clear that successful short-horizon prediction does not imply faithful recovery of the underlying physical mechanism (Jain et al., 11 May 2026).
5. Measurement-centered laboratory practice
Outside the benchmark itself, MaD Physics is also used to describe laboratory workflows in which discovery is organized around low-cost sensing and explicit data analysis. A smartphone magnetometry experiment provides a particularly compact example. One activity uses the phone’s three-dimensional sensors to determine the vector components and orientation of the background magnetic field at the measurement location and compares the resulting field to NOAA’s Magnetic Field Calculator. A second activity measures the axial field of the small magnet in an earbud speaker and fits the data with
0
The reported results place measured 1, 2, and 3 within a few percent of NOAA values, with ring-magnet fits yielding 4–5 mm, 6–7 mm, and reduced 8 (Gillen et al., 2022).
Muon-detector projects extend the same logic to particle physics. The “Desktop Muon Detector” is described as a self-contained apparatus using a plastic scintillator and a silicon photomultiplier, with total cost approximately \$\sigma_k$9%%%%5$(t_k,o_k,\sigma_k)$5$t_k$5$k$5$o_k$5$\sigma_k$5%%%%50.3$y_k = o_k\bigl(s(t_k)\bigr)+\varepsilon_k,\qquad \varepsilon_k\sim\mathcal{N}(0,\sigma_k^2).$60.5$y_k = o_k\bigl(s(t_k)\bigr)+\varepsilon_k,\qquad \varepsilon_k\sim\mathcal{N}(0,\sigma_k^2).$70.02$y_k = o_k\bigl(s(t_k)\bigr)+\varepsilon_k,\qquad \varepsilon_k\sim\mathcal{N}(0,\sigma_k^2).$80.1$y_k = o_k\bigl(s(t_k)\bigr)+\varepsilon_k,\qquad \varepsilon_k\sim\mathcal{N}(0,\sigma_k^2).C(o_k,\sigma_k) ft over the geomagnetic equator (Axani et al., 2016, Axani, 2019).
These examples show a version of MaD Physics in which the constraint is not a simulated benchmark budget but the practical limitation of instrumentation, cost, and student-facing experimental infrastructure. The common structure remains the same: choose observables, calibrate measurement channels, fit physically motivated models, and use discrepancies or residuals to refine interpretation (Gillen et al., 2022).
6. Computational discovery workflows and interpretable laws
A distinct but related MaD Physics usage appears in hybrid machine-learning pipelines for scientific law discovery. In work on nuclear charge radii, the workflow begins with numerical regression using Light Gradient Boosting Machine and Gaussian Process Regression, both trained with four-fold cross-validation and automated hyperparameter optimization. On out-of-fold predictions over 1 nuclei with measured radii, the reported RMSE values are 2 fm for LGBM and 3 fm for GPR. Symbolic regression is then performed on 4 ML-predicted and extrapolated points using PySRRegressor, with a Pareto frontier used to balance MSE against expression complexity (Maheshwari et al., 7 Dec 2025).
The resulting expressions are explicit. At complexity level 5, the GPR-distilled formula is
6
with RMSE 7 fm, while the LGBM-distilled formula is
8
with RMSE 9 fm. At complexity 0, both distilled families recover nearly the full numerical accuracy, at about 1 fm RMSE. The paper explicitly presents this as a MaD Physics cycle: numerical regression produces precise surrogate measurements, and symbolic regression then exposes interpretable physics expressions involving 2, 3, binding energy, isospin asymmetry, Casten factor, and nonlinear composite terms (Maheshwari et al., 7 Dec 2025).
Related research pursues comparable discovery goals with different measurement front ends. “Visual Physics” uses bounding-box trajectories extracted by Mask R-CNN, a 4-VAE latent model, and Eureqa-based symbolic regression to recover textbook laws such as free fall and uniform circular motion from video. In free fall, the study reports latent correlations with initial velocities of at least 5, recovery of 6 in synthetic data, and real-world recovery of 7 within 8 on hand-tossed basketball videos (Chari et al., 2019). “Physics-Informed Müntz-Szász Networks” make power-law exponents trainable and explicit; in a 9-configuration wedge benchmark, constraint-aware training attains a 0 success rate with 1 mean error, whereas naive training succeeds in 2 of cases with mean error 3 (N'guessan et al., 30 Jan 2026).
A plausible implication is that MaD Physics, in its broader methodological sense, increasingly denotes a pipeline in which measurements are not only collected but structured to support interpretable inference. The front end may be a phone magnetometer, a scintillator counter, a rendered physical environment, or a learned surrogate model; the back end is a law-discovery step constrained by physical compatibility (Maheshwari et al., 7 Dec 2025).
7. Significance, misconceptions, and future directions
MaD Physics is significant because it joins three research problems that are often separated: active measurement, model discovery, and physically grounded evaluation. The benchmark formulation makes explicit that informative sensing under budget is part of scientific competence, not merely a prelude to it. This directly counters the misconception that discovery benchmarks can be reduced to symbolic regression on a fixed dataset (Jain et al., 11 May 2026).
The phrase also appears in high-energy theory-experiment synergy. In rare kaon decays, the “Synergy of theory & experiment (‘MaD Physics’)” is described as the combination of lattice-QCD reduction of uncertainties in 4, 5, and 6 with NA62, KOTO, and successor measurements. Under this program, 7 and 8 are projected to probe new-physics scales up to about 9 TeV when theory and experiment reach the few-percent level (Blum et al., 2022). This use of the term is not benchmark-oriented, but it reinforces the same underlying logic: measurement capability and theoretical identifiability advance together.
The benchmark authors outline several future directions. These include richer tool sets such as advanced code execution, plotting, and symbolic-algebra tools; prompt-engineering methods such as AlphaEvolve or GEPA; evaluation of non-Gemini frontier models; new environments involving more complex or higher-dimensional physical systems and other scientific domains; and formal Bayesian-design integration through acquisition functions such as mutual-information estimators. The benchmark can also be configured for multimodal reasoning by replacing numerical observations with rendered visual observations, although the initial experiments report substantially higher errors in that mode (Jain et al., 11 May 2026).
Taken together, the literature presents MaD Physics as both a specific benchmark and a broader research pattern. In the narrow sense, it is an extensible testbed for evaluating agents that must measure, plan, and infer under constraints. In the broader sense, it names a style of physics research in which discovery is driven by the disciplined coupling of measurements, cost-aware experimental choices, and interpretable models (Jain et al., 11 May 2026).