Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

Published 13 May 2026 in cs.CL | (2605.14040v1)

Abstract: We audit the multimodal-physics evaluation pipeline end-to-end and document three undetected construction practices that distort how the field measures vision-language reasoning: train-eval contamination, translation drift, and MCQ saturation. (1) Public training pools (UGPhysics-Train, SciInstruct, MMK12) pass single-stage 5-gram-Jaccard audits with zero hits across all six public physics evals; a three-stage audit (Jaccard -> mxbai-embed-large cosine -> Haiku-4.5 LLM-judge) surfaces 134 near-duplicates and 4,846 paraphrase candidates in SciInstruct alone. (2) A 17-pp Sonnet 4.5 delta on 59 paired Estonian-English olympiad problems (30.5% vs. 13.6%; sign test p=0.011, McNemar p=0.021, paired bootstrap 95% CI [+5.1, +28.9] pp). (3) A 46-pp format-and-novelty gradient on identical Sonnet weights between MCQ (79.7% on PhyX) and open-ended olympiad evaluation (33.4% on PhysOlym-A). We release four artifacts addressing these gaps: PhysCorp-A (6,432-record three-stage-audited multimodal corpus), PhysR1Corp (2,268-record closed-form RL pool), PhysOlym-A (500-problem, 99.8% novel-source held-out olympiad eval with native difficulty labels and an EN/ET bilingual subset), and Physics-R1, a reference GSPO+DAPO recipe cold-started from Qwen3-VL-8B-Thinking. Across 3 seeds, Physics-R1 lifts the audited corpus over the 8B base by +18.3 pp on PhysOlym-A liberal (8.0 -> 26.3 +/- 1.7; 7.1 pp behind Sonnet 4.5), +15.7 pp on PhysReason (23.9 -> 39.6 +/- 6.4; ahead of Qwen3-VL-32B and Gemini 2.5 Pro), +6.9 pp on OlympiadBench-Physics (46.2 +/- 1.5), and +4.1 pp on PhyX MCQ (77.8 +/- 0.3).

Abstract PDF Upgrade to Chat

Authors (1)

Shan Yang

Summary

The paper introduces a three-stage contamination audit that detects near-duplicate paraphrases missed by standard n-gram methods.
It offers audit-clean datasets and a reinforcement learning recipe optimized with binary correctness rewards for robust physics reasoning.
Empirical results highlight significant drops in performance due to translation drift and format novelty, challenging conventional MCQ evaluations.

Summary of "Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning"

Motivation and Context

The investigation rigorously interrogates the methodology of multimodal (vision-language) physics reasoning benchmarks and practice in evaluation. It primarily addresses three methodological failings in the field: inadequate contamination auditing between training and evaluation splits, translation-driven evaluation drift in multilingual benchmarks, and the saturation of MCQ-style evaluations that mask real open-ended capability differences. These gaps systematically introduce measurement error, miscalibrate leaderboard rankings, and obscure genuine ability gradients in open scientific LLM and VLM models.

Methodological Innovations

Contamination Auditing Pipeline

A central technical advance is the introduction of a three-stage contamination audit pipeline. Conventional n-gram deduplication (5-gram Jaccard index, $J \geq 0.4$ ) is shown to have negligible detection power against paraphrase-class leakage, missing all semantic contamination in public training pools. The pipeline adds an embedding-based recall pass (mxbai-embed-large, cosine $\geq 0.85$ ) and a high-precision LLM-judge filter (Anthropic Haiku-4.5) to classify close duplicates from same-topic neighbors. This combined methodology surfaces 134 near-duplicates and 4,846 paraphrase-class candidates in SciInstruct alone, revealing substantial overlap between training and evaluation sets that prior audits missed.

Multilingual Evaluation Drift

The paper dissects translation drift by leveraging a paired Estonian-English physics olympiad subset. On 59 paired items, Sonnet 4.5 exhibits fully 17 p.p. absolute accuracy drop (30.5% on Estonian originals vs. 13.6% on English translations, $p=0.011$ sign test), confirming that cross-lingual benchmarks using only translated problems misestimate true model performance, depending on the model’s cross-lingual competence and the fidelity of scientific translation.

Format and Novelty Gradient

Identical model weights (Sonnet 4.5) evaluated across PhyX (4-way MCQ), OlympiadBench-Physics (open-ended), and the new PHYSOLYM-A (fully contamination-clean, high-novelty, open-ended) reveal a pronounced 46 percentage point drop in accuracy. MCQ format saturates (79.7%), while open-ended, high-novelty evaluation exposes capability deficits (33.4%). This demonstrates the major role of both problem format and genuine novelty (i.e., unseen sources, not paraphrastic or numerically perturbed variants) in measurable physical reasoning, beyond what scale or routine RL-finetuning can recover.

Released Artifacts and Protocols

Four key artifacts are released following this protocol:

PHYSCORP-A: A 6,432-problem, three-stage-audited, multimodal training and evaluation corpus, built from a union of textbook, olympiad, StackExchange, and repackaged benchmark material, with first-release machine-learning format for Estonian Physics Olympiad and international olympiads. Full source path and license is preserved per-record.
PHYSR1CORP: A 2,268-problem, strict-audit-verified RL training split suitable for GSPO-style fine-tuning.
PHYSOLYM-A: A 500-problem, 99.8% novel-source, difficulty-calibrated, bilingual (EN/ET), open-ended olympiad benchmark. Difficulty is annotated with native organizer-issued scales where available (e.g., Estonian 1–10).
Physics-R1 Recipe: An RL scheme (GSPO+DAPO backbone) that cold-starts from Qwen3-VL-8B-Thinking, trains on PHYSR1CORP, and uses MCQ-audited early stopping and binary correctness rewards (empirically validated as variance- and Goodhart-robust).

All artifacts ship under original or restrictive academic-use licenses as required, with per-record provenance. Audit, reward implementation, and LLM-judge scripts are provided.

Evaluation and Empirical Findings

Three Principal Findings

Extensive Paraphrase Contamination in Public Pools: The three-stage audit detects up to 8.8% contamination (at $\cos \geq 0.85$ ) in carefully curated samples and up to 27.1% at lower thresholds. Public MCQ/physics-VL training pools claimed to be contamination-free by single-stage n-gram methods are in fact substantially leaky.
Translation Drift is Systemic: Translation of problems introduces a substantial score delta, systematically underestimating or overestimating model abilities depending on language and model pretraining. This effect is quantifiable and must be considered in any cross-lingual benchmark.
Format and Novelty Bound the Signal: The 46-point format-novelty drop with fixed weights (MCQ $\rightarrow$ open-ended, saturated $\rightarrow$ novelty-audited) is not recoverable via scale or standard RL alone, establishing open-ended, contamination-audited benchmarks as necessary to reveal latent progress.

Physics-R1 Model Performance

Physics-R1, following the protocol above, is trained across three seeds. On PHYSOLYM-A, it advances the 8B open-source base by +18.3 percentage points (8.0 $\rightarrow$ 26.3, trailing Sonnet 4.5 frontier by 7.1 points), with similar lifts on other open-ended benchmarks (e.g., +15.7 pp on PhysReason). The result is stable across seeds, and analysis attributes improvement to three main error modes of the base model: failure to commit to final answers, over-reliance on unit/dimensional heuristics, and single-image panel fixation. The binary reward configuration consistently outperforms dense, physics-structured rewards on open-ended evaluation—demonstrating the inefficacy and even risk of reward shaping along surface-form axes in these domains.

Robustness and Reproducibility

The audit pipeline’s findings are robust to embedder and judge model choice (Spearman $\rho=0.78$ with OpenAI’s text-embedding-3-large; cross-judge agreement with GPT-4o supports claims of judge leniency being opposite to self-grading bias). The methodology, data, and scripts are released for independent re-audit or extension.

Implications and Future Directions

Theoretical and Practical Impact

This work exposes critical flaws in both published benchmark construction and claims of progress in physics-VLM capability, providing concrete repeatable methodology (three-stage audit, native-language gold, and hard novel open-ended eval) and high-quality artifacts that set a new standard for contamination robustness. The results serve as a direct warning that MCQ-centric or solo n-gram auditing regimes can no longer serve as reliable evaluation strategies, and leaderboard-based claims built atop them are methodologically suspect. The work also provides, via PHYSOLYM-A, a non-saturating held-out evaluation protocol—removing systematic overreporting of progress.

Forward-Looking Considerations

Future work is explicitly outlined: MCQ-neutralization of open-ended benchmarks to orthogonalize format vs. novelty, embedding-model ablation, paraphrase- and translation-aware audit, further evaluation across more diverse VLM architectures, and SFT-scale ablations. The cross-lingual drift phenomenon is earmarked for targeted re-examination on low-resource-language-weak models, with the expectation that sign may invert depending on pretraining.

Conclusion

"Physics-R1" systematically reconstructs multimodal physics reasoning evaluation around rigorous decontamination, format-aware measurement, and careful artifact release. The evidence demonstrates that superficial data cleaning, translation reliance, and MCQ benchmarking are insufficient for trustworthy measurement. The released datasets, audit tools, and RL recipe enable the community to pursue genuinely novel, robust, and scientifically meaningful advances in visual physics reasoning evaluation for LLMs and VLMs.

Markdown Report Issue