ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

Published 18 Nov 2025 in cs.CL | (2511.14366v1)

Abstract: The rapid advancement of LLMs has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a novel 800-problem, contamination-resistant benchmark spanning 7 scientific fields to advance evaluation of LLM scientific reasoning capabilities.
It employs a hybrid human-AI pipeline that uses expert problem generation along with adversarial filtering to ensure original, complex, and reliable challenges.
Evaluation results reveal significant gaps in current models' reasoning abilities, providing diagnostic insights and future directions for AI in scientific discovery.

"ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning" (2511.14366)

Introduction to Benchmark Saturation and Evaluation Needs

The rapid progress in LLMs has led to a "benchmark saturation" where traditional evaluation sets can no longer effectively differentiate advanced models. Benchmarks like MMLU have been surpassed by state-of-the-art models with very high accuracy, making them ineffective for assessing nuanced capabilities. This saturation highlights the need for more challenging benchmarks that can propel AI advancements, particularly in scientific reasoning, an area identified as crucial for future breakthroughs in AI.

Simultaneously, existing high-difficulty benchmarks have limitations regarding narrow disciplinary focus and susceptibility to data contamination. ATLAS addresses these issues by providing a multidisciplinary evaluation tool specifically designed to test scientific reasoning capabilities—a step forward in guiding AI development in science-focused applications.

ATLAS Design and Construction

ATLAS is meticulously designed to confront challenges of benchmark saturation and contamination. Its construction incorporates high-difficulty and rigorous contamination-resistant methodologies. It contains approximately 800 expertly crafted problems across seven core scientific disciplines, setting the difficulty to ensure a pass rate below 20% for current leading models. The problems retain real-world complexity and require answers that reflect deep scientific reasoning.

Figure 1: Overview of ATLAS, which contains 7 stem subjects and 57 corresponding sub-fields.

ATLAS employs a hybrid human-AI pipeline, utilizing expert problem-generation complemented with adversarial filtering by state-of-the-art models. This rigorous process ensures problem originality and appropriate complexity, while also facilitating scalable evaluation through its judgment workflow. Responses are parsed and judged using sophisticated LLMs to verify natural and symbolic answer formats, allowing real-world problem retention.

Evaluation and Dataset Analysis

ATLAS provides a robust platform for evaluating LLMs' performance on scientific reasoning tasks. Through detailed analysis, models are prompted to solve problems with outputs formatted for rigorous machine-led judgment. The dataset reveals a broad distribution of subjects and problem types, emphasizing complex reasoning and structured multi-part answers.

Figure 2: Overview of the evaluation workflow. During the evaluation process, the LLM is prompted to provide formatted predictions, from which the answers are extracted and input into the Judge LLMs for the computation of evaluation metrics.

The quantitative and qualitative analyses of ATLAS highlight its ability to distinguish high-quality reasoning capabilities among models. With a comprehensive suite of tests, ATLAS provides insights into models' grasp on diverse scientific subjects, from molecular biology to materials science.

Benchmark Results and Observations

The evaluation results indicate that current models, while advanced, still have notable gaps in expert-level scientific reasoning. OpenAI GPT-5-High emerges as a leading model across subjects, yet all models exhibit significant variance in effectiveness, pointing to areas needing improvement. The benchmark provides diagnostic insights into common errors, including precision in numerical outputs and the handling of complex reasoning chains.

Implications and Future Directions

The findings from ATLAS underscore the criticality of developing benchmarks that prioritize real-world scientific reasoning: an essential domain in the pursuit of AGI. By focusing on the interdisciplinary nature of science, ATLAS sets a quantitative standard for measuring scientific capabilities in AI models.

ATLAS aims to evolve as a long-term platform, actively expanding its scope and fostering community collaboration. Its ongoing development seeks to incorporate more languages and scientific fields, aiming to establish itself as a sustainable resource for evaluating and advancing AI's role in scientific discovery.

In conclusion, ATLAS represents a meaningful shift towards benchmarks that more accurately reflect the complexities of scientific reasoning, offering both immediate insights into current model capabilities and strategic directions for future research in AI for Science.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Authors (36)

First 10 authors:

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

Summary

"ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning" (2511.14366)

Introduction to Benchmark Saturation and Evaluation Needs

ATLAS Design and Construction

Evaluation and Dataset Analysis

Benchmark Results and Observations

Implications and Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (36)

Collections

Tweets

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

Summary

"ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning" (2511.14366)

Introduction to Benchmark Saturation and Evaluation Needs

ATLAS Design and Construction

Evaluation and Dataset Analysis

Benchmark Results and Observations

Implications and Future Directions

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (36)

Collections

Tweets