- The paper introduces a three-stage contamination audit that detects near-duplicate paraphrases missed by standard n-gram methods.
- It offers audit-clean datasets and a reinforcement learning recipe optimized with binary correctness rewards for robust physics reasoning.
- Empirical results highlight significant drops in performance due to translation drift and format novelty, challenging conventional MCQ evaluations.
Summary of "Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning"
Motivation and Context
The investigation rigorously interrogates the methodology of multimodal (vision-language) physics reasoning benchmarks and practice in evaluation. It primarily addresses three methodological failings in the field: inadequate contamination auditing between training and evaluation splits, translation-driven evaluation drift in multilingual benchmarks, and the saturation of MCQ-style evaluations that mask real open-ended capability differences. These gaps systematically introduce measurement error, miscalibrate leaderboard rankings, and obscure genuine ability gradients in open scientific LLM and VLM models.
Methodological Innovations
Contamination Auditing Pipeline
A central technical advance is the introduction of a three-stage contamination audit pipeline. Conventional n-gram deduplication (5-gram Jaccard index, J≥0.4) is shown to have negligible detection power against paraphrase-class leakage, missing all semantic contamination in public training pools. The pipeline adds an embedding-based recall pass (mxbai-embed-large, cosine ≥0.85) and a high-precision LLM-judge filter (Anthropic Haiku-4.5) to classify close duplicates from same-topic neighbors. This combined methodology surfaces 134 near-duplicates and 4,846 paraphrase-class candidates in SciInstruct alone, revealing substantial overlap between training and evaluation sets that prior audits missed.
Multilingual Evaluation Drift
The paper dissects translation drift by leveraging a paired Estonian-English physics olympiad subset. On 59 paired items, Sonnet 4.5 exhibits fully 17 p.p. absolute accuracy drop (30.5% on Estonian originals vs. 13.6% on English translations, p=0.011 sign test), confirming that cross-lingual benchmarks using only translated problems misestimate true model performance, depending on the model’s cross-lingual competence and the fidelity of scientific translation.
Identical model weights (Sonnet 4.5) evaluated across PhyX (4-way MCQ), OlympiadBench-Physics (open-ended), and the new PHYSOLYM-A (fully contamination-clean, high-novelty, open-ended) reveal a pronounced 46 percentage point drop in accuracy. MCQ format saturates (79.7%), while open-ended, high-novelty evaluation exposes capability deficits (33.4%). This demonstrates the major role of both problem format and genuine novelty (i.e., unseen sources, not paraphrastic or numerically perturbed variants) in measurable physical reasoning, beyond what scale or routine RL-finetuning can recover.
Released Artifacts and Protocols
Four key artifacts are released following this protocol:
- PHYSCORP-A: A 6,432-problem, three-stage-audited, multimodal training and evaluation corpus, built from a union of textbook, olympiad, StackExchange, and repackaged benchmark material, with first-release machine-learning format for Estonian Physics Olympiad and international olympiads. Full source path and license is preserved per-record.
- PHYSR1CORP: A 2,268-problem, strict-audit-verified RL training split suitable for GSPO-style fine-tuning.
- PHYSOLYM-A: A 500-problem, 99.8% novel-source, difficulty-calibrated, bilingual (EN/ET), open-ended olympiad benchmark. Difficulty is annotated with native organizer-issued scales where available (e.g., Estonian 1–10).
- Physics-R1 Recipe: An RL scheme (GSPO+DAPO backbone) that cold-starts from Qwen3-VL-8B-Thinking, trains on PHYSR1CORP, and uses MCQ-audited early stopping and binary correctness rewards (empirically validated as variance- and Goodhart-robust).
All artifacts ship under original or restrictive academic-use licenses as required, with per-record provenance. Audit, reward implementation, and LLM-judge scripts are provided.
Evaluation and Empirical Findings
Three Principal Findings
- Extensive Paraphrase Contamination in Public Pools: The three-stage audit detects up to 8.8% contamination (at cos≥0.85) in carefully curated samples and up to 27.1% at lower thresholds. Public MCQ/physics-VL training pools claimed to be contamination-free by single-stage n-gram methods are in fact substantially leaky.
- Translation Drift is Systemic: Translation of problems introduces a substantial score delta, systematically underestimating or overestimating model abilities depending on language and model pretraining. This effect is quantifiable and must be considered in any cross-lingual benchmark.
- Format and Novelty Bound the Signal: The 46-point format-novelty drop with fixed weights (MCQ → open-ended, saturated → novelty-audited) is not recoverable via scale or standard RL alone, establishing open-ended, contamination-audited benchmarks as necessary to reveal latent progress.
Physics-R1, following the protocol above, is trained across three seeds. On PHYSOLYM-A, it advances the 8B open-source base by +18.3 percentage points (8.0 → 26.3, trailing Sonnet 4.5 frontier by 7.1 points), with similar lifts on other open-ended benchmarks (e.g., +15.7 pp on PhysReason). The result is stable across seeds, and analysis attributes improvement to three main error modes of the base model: failure to commit to final answers, over-reliance on unit/dimensional heuristics, and single-image panel fixation. The binary reward configuration consistently outperforms dense, physics-structured rewards on open-ended evaluation—demonstrating the inefficacy and even risk of reward shaping along surface-form axes in these domains.
Robustness and Reproducibility
The audit pipeline’s findings are robust to embedder and judge model choice (Spearman ρ=0.78 with OpenAI’s text-embedding-3-large; cross-judge agreement with GPT-4o supports claims of judge leniency being opposite to self-grading bias). The methodology, data, and scripts are released for independent re-audit or extension.
Implications and Future Directions
Theoretical and Practical Impact
This work exposes critical flaws in both published benchmark construction and claims of progress in physics-VLM capability, providing concrete repeatable methodology (three-stage audit, native-language gold, and hard novel open-ended eval) and high-quality artifacts that set a new standard for contamination robustness. The results serve as a direct warning that MCQ-centric or solo n-gram auditing regimes can no longer serve as reliable evaluation strategies, and leaderboard-based claims built atop them are methodologically suspect. The work also provides, via PHYSOLYM-A, a non-saturating held-out evaluation protocol—removing systematic overreporting of progress.
Forward-Looking Considerations
Future work is explicitly outlined: MCQ-neutralization of open-ended benchmarks to orthogonalize format vs. novelty, embedding-model ablation, paraphrase- and translation-aware audit, further evaluation across more diverse VLM architectures, and SFT-scale ablations. The cross-lingual drift phenomenon is earmarked for targeted re-examination on low-resource-language-weak models, with the expectation that sign may invert depending on pretraining.
Conclusion
"Physics-R1" systematically reconstructs multimodal physics reasoning evaluation around rigorous decontamination, format-aware measurement, and careful artifact release. The evidence demonstrates that superficial data cleaning, translation reliance, and MCQ benchmarking are insufficient for trustworthy measurement. The released datasets, audit tools, and RL recipe enable the community to pursue genuinely novel, robust, and scientifically meaningful advances in visual physics reasoning evaluation for LLMs and VLMs.