Papers
Topics
Authors
Recent
Search
2000 character limit reached

Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs

Published 9 May 2026 in cs.CL | (2605.09063v1)

Abstract: Following the recent achievement of gold-medal performance on the IMO by frontier LLMs, the community is searching for the next meaningful and challenging target for measuring LLM reasoning. Whereas olympiad-style problems measure step-by-step reasoning alone, research-level problems use such reasoning to advance the frontier of mathematical knowledge itself, emerging as a compelling alternative. Yet research-level math benchmarks remain scarce because such problems are difficult to source (e.g., Riemann Bench and FrontierMath-Tier 4 contain 25 and 50 problems, respectively). To support reliable evaluation of next-generation frontier models, we introduce Soohak, a 439-problem benchmark newly authored from scratch by 64 mathematicians. Soohak comprises two subsets. On the Challenge subset, frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively, leaving substantial headroom, while leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%. Notably, beyond standard problem solving, Soohak introduces a refusal subset that probes a capability intrinsic to research mathematics: recognizing ill-posed problems and pausing rather than producing confident but unjustified answers. On this subset, no model exceeds 50%, identifying refusal as a new optimization target that current models do not directly address. To prevent contamination, the dataset will be publicly released in late 2026, with model evaluations available upon request in the interim.

Summary

  • The paper introduces a contamination-resistant, expert-curated benchmark targeting research-level mathematical reasoning in LLMs.
  • It details a rigorously constructed dataset with three subsets—Challenge, Refusal, and SOOHAK-Mini—evaluated across various LLM architectures.
  • The evaluation exposes limitations in handling ill-posed queries and highlights the need for improved hallucination suppression in LLMs.

SOOHAK: Benchmarking Research-Level Mathematical Reasoning in LLMs

Motivation and Context

The proliferation of LLMs capable of mathematical reasoning on olympiad-level content has necessitated next-generation benchmarks that rigorously probe graduate and research-adjacent mathematical capability. Existing resources—such as MATH, GSM8K, and more recent benchmarks like FrontierMath and Riemann-Bench—have become either saturated by top LLMs or are too narrow or limited in scope for differentiating state-of-the-art models. Contamination from public sources further erodes the reliability of many benchmarks due to overlap with training data, resulting in overestimated generalization capacity. SOOHAK directly addresses these limitations by curating a contamination-resistant, expert-authored, bilingual, and multi-faceted benchmark for next-generation LLM evaluation (2605.09063).

Dataset Design and Construction

SOOHAK is constructed with methodological rigor, leveraging a large and diverse contributor pool (105 mathematicians, 64 on main set, including 48% faculty and 25% undergraduates) and nearly $550,000 in funding. The benchmark comprises three primary subsets:

  • Challenge: 340 newly-authored, graduate-level and research-adjacent problems synthesized to evade LLM solution via an explicit model-gated pipeline. Contributors were constrained to original problems, no AI usage, and signed strict NDA and IP-transfer agreements.
  • Refusal: 99 items derived from intentionally ill-posed or flawed problems, targeting the evaluation of a model's ability to recognize and appropriately abstain from answering mathematically unsound queries—an essential but underexplored trait in research mathematics.
  • SOOHAK-Mini: 702 questions spanning olympiad to early graduate content, constructed for coverage across varying LLM capabilities, with less stringent gating and broader contributor base.

Items are distributed across core mathematics domains: algebra, number theory, combinatorics, analysis, geometry/topology, applied mathematics, probability, and logical foundations, with explicit coverage tracked by both contributor-supplied keywords and LLM-classified Mathematics Subject Classification (MSC) tags.

The data collection pipeline includes AI gating at three scales (using open models from 7B to 235B parameters), followed by dual human expert review. A robust mechanism for leakage mitigation, question originality verification, and rigorous bilingual translation is employed, with an embargo on public release until late 2026 to preclude training data contamination.

Evaluation Protocol

Eleven LLMs—encompassing closed-weight (e.g., Gemini-3-Pro, GPT-5, Claude-Opus-4.5) and open-weight (e.g., Qwen3-235B, GPT-OSS-120B, Kimi-2.5, GLM-5) architectures—are evaluated on the SOOHAK splits. The evaluation protocol comprises:

  • Sampling: Three independent responses per model-question pair ("avg@3", "pass@3" metrics).
  • Judging: An LLM-based judge (GPT-5-Mini) matches model output to gold answers for equivalence, blinded to question text and solution.
  • Human Baselines: Five teams with IMO medalists, math undergraduates, and research PhDs solve a 79-question subset under strict time and tool-use constraints, permitting non-AI computational aids.

Main Results

Quantitative Performance

On the SOOHAK Challenge split, top models substantially underperform compared to their olympiad-level baselines:

  • Gemini-3-Pro: Avg@3 = 30.39%
  • GPT-5: Avg@3 = 26.37%
  • Claude-Opus-4.5: Avg@3 = 10.39%
  • Best Open Model (Kimi-2.5): Avg@3 = 13.87%

On SOOHAK Refusal, no model exceeds 50% correctness ("carefulness"), indicating a prominent failure mode: hallucination and overconfident responses to ill-posed or ambiguous prompts. GLM-5 leads with 49.49% Avg@3, outperforming closed models on this axis.

On SOOHAK-Mini, scores are higher—e.g., GPT-5 achieves 72.22%—showing the benchmark’s discriminative utility at the research-oriented difficulty frontier.

Human participants collectively achieve 50.6% coverage on the evaluation set. Only Gemini-3-Pro (60.8%) surpasses combined human performance. Notably, the highest-performing single human teams are composed of undergraduate math majors with olympiad backgrounds, not research PhDs, indicating the format currently favors contest-style training over deep research specialization under time constraints.

Diagnostic Insights

  • Open-weight models are competitive on SOOHAK-Mini but display a pronounced deficit on Challenge items, underscoring data exposure and architectural limitations in transferring to novel, research-level mathematics.
  • Compute scaling (model size/test-time inference length) yields steady improvements on Challenge, but refusal capability does not scale analogously, highlighting that hallucination suppression requires distinct interventions from reasoning enhancement.
  • MSC-subfield analysis reveals substantial across-model variance, with Gemini-3-Pro dominating number theory and analysis, Grok-4.1-Fast performing best in geometry and stochastic fields, and only one open-weight system (GPT-OSS-120B, in linear algebra) leading a subfield.

Implications

Practical Considerations

SOOHAK sets a new standard in mathematical benchmark construction, integrating robust anti-contamination policies, rigorous contributor verification, and challenging domains. Its embargo strategy, bilingual format, and embargoed release mitigate training data leakage without sacrificing eventual transparency and reproducibility. Evaluation on both soundness and carefulness (refusal) challenges the domain-centric, step-reasoning benchmarks that dominate current practice, foregrounding the importance of recognizing ill-posedness and the limits of model expressivity.

Theoretical Relevance

By decoupling performance on "reasoning" (solving well-posed, research-adjacent mathematics) and "carefulness" (abstaining on ill-posed queries), SOOHAK sharpens diagnostic precision regarding LLM failure modes, particularly hallucination and overconfidence. The observed absence of scaling in refusal metrics suggests that naive increases in parameter count or test-time compute are insufficient for addressing epistemic uncertainty and dispatching hallucination—this problem likely requires architectural or training regime modifications.

Limitations

The "unique integer answer" format, while promoting automatable grading and low ambiguity, restricts the universe of tractable items and penalizes proofs, constructions, or answers in equivalence classes inherent to higher mathematics. Difficulty-driven incentives can misalign with quality, and the compressed development timeframe limited reviewer bandwidth and global recruiting, resulting in some subfield imbalance.

Future Directions

  • Proof-centric and construction-based evaluation: Integrating proof assistant verification or structured answer pipelines for domains where unique numeric answers are inadequate.
  • Refusal and calibration research: Advancing methods for robust uncertainty estimation, detection of ill-posedness, and reliable refusal in LLMs.
  • Expanded metrics: Beyond accuracy, integrating measures of solution diversity, explanatory depth, and process transparency.
  • Global contributor expansion: Further democratizing and diversifying mathematical content via broader recruitment.

Conclusion

SOOHAK delivers a critical resource for benchmarking research-level mathematical reasoning, with design and execution that directly confront contamination, overfitting, and narrowness of prior datasets. It exposes substantial headroom even for frontier LLMs, especially at the interface of mathematical soundness and carefulness. SOOHAK’s structure enables multidimensional evaluation—model reasoning capacity, careful refusal, and cross-subfield coverage—paving the way for robust, transparent measurement at the cutting edge of AI mathematical reasoning. Its structure, limitations, and future expansion points collectively form a template for next-generation evaluation in mathematical and other high-reasoning domains.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

What this paper is about

This paper introduces SOOHAK, a new “tough math test” made by real mathematicians to see how well AI LLMs can handle advanced, research-style math. It doesn’t just check if a model can solve hard problems; it also checks whether the model knows when a problem is flawed and should be refused instead of guessed at.

The big questions the paper asks

  • Can today’s AI models solve graduate-level and research-like math problems that go beyond typical contest questions?
  • Do models recognize “bad” questions (for example, missing information or contradictions) and refuse to answer instead of making things up?
  • How can we build a fair, fresh benchmark that isn’t already in the models’ training data?

How they built and tested SOOHAK

Think of SOOHAK like a carefully designed exam:

  • Two main parts:
    • Challenge: 340 new, advanced math problems written by experts (graduate level and research-adjacent).
    • Refusal: 99 tricky prompts that are ill-posed (they’re missing key info, have conflicts, or don’t have a unique answer). The “right” behavior is to explain what’s wrong and not give a fake answer.
  • Fresh, human-written questions: 64 mathematicians created the 439 Challenge+Refusal problems from scratch. Across the whole project (including an easier companion set), 105 contributors were involved.
  • Preventing “contamination”: Instead of scraping old contests or textbooks, the team commissioned original problems, used non-disclosure agreements, and checked for AI-generated submissions (banning offenders). They plan to publicly release the dataset in late 2026 to keep it from leaking into training sets too early; until then, they’ll evaluate models by request.
  • Difficulty “gates” using models: Submissions were tested by smaller and larger open models. Only problems that all these models failed made it into the Challenge set. Easier problems went to SOOHAK-Mini (a 702-question companion set ranging from olympiad level to early graduate).
  • How they scored models: For each question, each model tried three times. Two measures were used:
    • Avg@3: On average, how many of those three tries were correct?
    • Pass@3: Did the model get it right at least once in three tries?
  • Equivalence checking: Since math answers can be written in different but equivalent ways, an automated “answer judge” checked whether a model’s final answer matched the meaning of the gold answer.
  • Human baseline: Five human teams (from IMO medalists to math researchers) solved a shared set of 79 sampled problems under a time limit to help interpret how hard the benchmark really is.

What they found

Here are the most important results and what they mean:

  • Advanced problems are still very hard for top models.
    • On the Challenge set, the best scores were modest: for example, Gemini-3-Pro: about 30% Avg@3; GPT-5: about 26%; Claude-Opus-4.5: about 10%.
    • Many problems remain unsolved by any model, leaving a lot of room for improvement.
  • Open vs. closed models:
    • On the big-picture, mixed-difficulty SOOHAK-Mini set, strong open models did fairly well (best around mid-60% Avg@3), while top closed models reached a bit over 70%.
    • On the research-adjacent Challenge set, open models dropped more: the best open model scored under 14% Avg@3 versus around 30% for the best closed model. This suggests today’s open models struggle more with fresh, specialized math.
  • Knowing when not to answer is a new weakness:
    • On the Refusal set, no model reached 50% Avg@3. Many models still try to produce confident answers to broken questions instead of saying “this problem is ill-posed.” One open model (GLM-5) scored best here, just under 50%.
  • More compute helps with solving, not with refusing:
    • Giving models more “thinking time” or using larger models improved performance on the Challenge questions in a steady way.
    • The same scaling did not consistently improve Refusal performance. Recognizing and refusing bad questions seems to be a different skill that current training doesn’t target well.
  • Humans vs. models on a 79-question sample:
    • Combined human teams solved about half the questions. Only one model (Gemini-3-Pro) beat that combined coverage.
    • Contest-style training (like International Math Olympiad experience) helped more than narrow research specialization under the short time limit—likely because the format rewards fast, multi-step problem solving over deep, slow exploration.

Why this matters

  • A new target beyond contests: Many AI math tests are based on public contest problems that models may have seen before. SOOHAK offers fresh, expert-written, research-adjacent challenges that better reflect what it takes to push math forward.
  • Don’t just reason—know when to stop: The Refusal set highlights an essential research skill: recognizing when a question is flawed. Current models aren’t good at this yet, pointing to a clear training goal for the future.
  • Fairer, longer-lasting evaluation: By avoiding public sources and delaying release, the benchmark aims to stay clean and useful for tracking genuine progress.
  • Guidance for model builders: Results show that:
    • There’s significant headroom in advanced math reasoning.
    • Open models need better transfer to research-type problems.
    • Training for careful refusal and uncertainty handling is important.
    • Simply adding more compute won’t fix refusal behavior.

In short, SOOHAK gives the AI community a fresh, demanding, and fair way to measure the next step in math reasoning: not just solving hard problems, but also recognizing when a problem can’t be solved as written—just like a careful mathematician would.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper.

  • Benchmark access and reproducibility: The full dataset is embargoed until late 2026 and evaluations are “by request,” preventing independent replication, auditing, and community-led error discovery; a concrete plan for third‑party audits and public release of the evaluation harness is not provided.
  • Gating-induced selection bias: Challenge items are selected by failure of a specific panel of open-weight models, which may bias the distribution toward those models’ blind spots; robustness of item selection under alternative gating panels (including closed models or future checkpoints) is unstudied.
  • Coverage imbalance across mathematical domains: The dataset is heavily concentrated in Algebra & Discrete (especially number theory and combinatorics) with minimal coverage of probability, statistics, applied math/OR, and logic; how this imbalance affects generalization and score comparability across subfields is unquantified.
  • Omission of proof-centric evaluation: The format requires an explicit final-answer line, excluding proofs and multi-claim reasoning central to research mathematics; how to incorporate and reliably score proof generation and verification (e.g., with formal checkers or rubric-based grading) is left open.
  • No multimodal math: Diagrams and figures are disallowed, limiting geometry/topology and applied settings where visuals are essential; how to extend SOOHAK to robustly evaluate multimodal mathematical reasoning is unexplored.
  • Reliability of LLM-assigned subject tags: MSC subject assignment uses a GPT-5-mini classifier; there is no reported human validation, inter-annotator agreement, or error analysis to assess misclassification rates and their impact on subfield analyses.
  • Final-answer-only judging risks: The LLM judge sees only the gold and parsed answers (not the problem text), raising risks of false positives/negatives for mathematically equivalent forms, parameterized families, or unit-sensitive answers; systematic judge calibration and comparisons to symbolic/CAS checkers are missing.
  • Lack of process-faithfulness evaluation: Scoring ignores reasoning validity (only outcome matters); there is no assessment of whether correct answers arise from sound reasoning or spurious shortcuts, nor mechanisms to detect invalid chains-of-thought that coincidentally yield correct answers.
  • Refusal scoring robustness: A response is “correct” if it diagnoses a flaw, but criteria for distinguishing genuine diagnoses from generic hedges are not specified in a way that is externally reproducible; susceptibility to gaming and cross-grader reliability remain open.
  • Refusal generality and representativeness: Refusal items are drawn from rejected submissions (contradictions, missing assumptions), which may not reflect the broader ecology of ill-posedness in real research workflows; a taxonomy and balanced sampling of ill-posed categories is not established.
  • Scaling behavior of refusal: Refusal performance does not improve with model size or test-time compute, but the mechanisms (calibration, uncertainty estimation, self-critique, verifier use) remain uninvestigated; actionable training/evaluation protocols for selective prediction are not proposed.
  • Human baseline limitations: Results use 79 prompts, a 4.5-hour time budget, non-standardized test conditions, and small teams with variable tool use; confidence intervals, task-level variance, and replication with larger, more controlled human cohorts are missing.
  • Contest vs research skill mismatch: Human results suggest the format rewards contest-style “short-path” solutions under time pressure more than research-style exploration; how to redesign items, timing, and scoring to better reflect research workflows (e.g., literature use, iteration, partial results) is unresolved.
  • Verification of “research-adjacent” correctness: Some solutions reportedly rely on folklore or unpublished strengthening; procedures for independent expert validation, versioning of solutions, and post-release errata resolution are not specified.
  • Contamination and dedup checks: Although items are “newly authored,” ScienceBench-sourced problems and translated items may inadvertently overlap with pretraining data; there is no documented large-scale deduplication against public corpora or data-provenance audit.
  • Translation and cross-lingual validity: Items are bilingual (English/Korean) via MT + post-editing, but translation fidelity, terminology normalization, and cross-lingual equivalence are not quantitatively validated; model performance parity across languages is unreported.
  • Compute fairness and comparability: Models were tested with heterogeneous reasoning-effort and context-length settings (e.g., some at 81,920 tokens, others at 16,384), complicating fair comparisons; a standardized compute budget and ablation on decoding temperatures are lacking.
  • Tool-augmented and retrieval-augmented evaluation: Models were evaluated without external tools (e.g., CAS, theorem provers, literature retrieval), yet research practice is tool-heavy; how evaluation changes under tool use, and how to fairly provision tools across models, remains open.
  • Long-horizon, agentic workflows: The benchmark is single-turn with short time horizons; evaluating multi-episode exploration, hypothesis testing, and literature synthesis—core to research—remains unexplored.
  • Robustness to prompt phrasing: Sensitivity of performance to paraphrases, variable renaming, or minor LaTeX/format perturbations is not studied; no adversarial or invariance testing is reported.
  • Score uncertainty and significance: No confidence intervals, bootstrap analyses, or significance tests are provided for Avg@3/Pass@3 differences; the reliability of model rankings given n=340 (Challenge) and n=99 (Refusal) is unquantified.
  • Difficulty calibration and rubrics: The paper notes noisy difficulty labels but does not implement structured rubrics, multi-rater difficulty scoring, or equating across subfields; methods for difficulty anchoring and item-response modeling are not deployed.
  • Answer parsing edge cases: Parsing correctness for expressions, parameterized solutions, multiple valid forms, or modular equivalences is not audited at scale; policies for ambiguous answers are incomplete.
  • Impact of LLM-based screening on item style: Using LLMs to gate difficulty might steer contributors toward adversarial patterns the panel fails at, narrowing diversity; measuring the stylistic shift introduced by gating is not attempted.
  • Refusal subset size and diversity: With only 99 items, statistical power for fine-grained analyses (e.g., by flaw type) is limited; guidance on scaling the refusal suite to cover broader ill-posedness classes is absent.
  • Predictive validity to real research: It is unknown whether SOOHAK scores correlate with success in authentic research tasks (e.g., conjecture generation, lemma discovery); no downstream or longitudinal validation studies are provided.
  • Maintenance and longevity: There is no public plan for refreshing items to prevent saturation, handling post-release leakage, or curating a “living” benchmark while preserving comparability over time.
  • Transparency of ScienceBench sourcing: The provenance, deduplication, and validation of bulk-purchased items are not detailed sufficiently for external auditing.
  • Ethical/IP considerations: Contributor NDAs and full IP transfer limit reuse and derivative work pre-release; how this impacts open science, community contributions, and post-release governance is not addressed.

Practical Applications

Summary

The paper introduces SOOHAK, a mathematician-authored benchmark designed to evaluate graduate-level and research-adjacent mathematical reasoning in LLMs. It comprises (i) a Challenge subset (hard, research-adjacent problems), (ii) a Refusal subset (detecting ill-posed problems and abstaining), and (iii) a companion SOOHAK-Mini (olympiad to early graduate). Beyond dataset content, the work contributes methods for contamination-resistant data collection (LLM-gated difficulty routing, multi-stage review, contributor integrity checks), bilingual translation with LaTeX preservation, LLM-based answer judging for mathematical equivalence, and insights into scaling and human baselines. These yield several practical applications across industry, academia, policy, and products.

Below are actionable applications grouped into immediate (deployable now) and long-term (requiring further development) categories.

Immediate Applications

The items below can be implemented now, using the paper’s methods, the evaluation-by-request service, or by replicating the collection pipeline. Dependencies and assumptions are noted inline.

  • Advanced model evaluation and product readiness checks
    • Sector: Software/AI labs; R&D; model governance.
    • Use Case: Incorporate SOOHAK Challenge and Refusal scores into internal evaluation gates before releasing math-heavy or safety-critical features (e.g., scientific copilots, quantitative analysis bots).
    • Tools/Workflows: “Research-Readiness Index” combining Avg@3/Pass@3 with a Refusal score; best-of-n sampling harness for Pass@3; compute-scaling sweeps to estimate returns on test-time compute.
    • Assumptions/Dependencies: Dataset is embargoed until late 2026; interim evaluations are available upon request; requires budget and access to frontier models and evaluation APIs.
  • Refusal-aware safety and hallucination control
    • Sector: Healthcare, legal, finance, engineering software, autonomous agents.
    • Use Case: Add an abstention/refusal classifier trained and validated with SOOHAK Refusal-like prompts to reduce confidently wrong outputs in high-stakes workflows (e.g., decline to compute when assumptions are underspecified).
    • Tools/Workflows: “Refusal Head” or policy layer; refusal-centric reward models (RLAIF) and unit tests for “don’t-know” behavior; deployment KPIs that include Refusal pass rates.
    • Assumptions/Dependencies: Portability from math to target domain depends on curating domain-specific refusal sets; careful calibration to avoid over-refusal.
  • Contamination-resistant benchmark building for other domains
    • Sector: Healthcare, law, coding, physics, education assessments.
    • Use Case: Replicate the LLM-gated, multi-reviewer pipeline (originality attestations, NDA/IP, AI-use detection, multi-model failure gates) to build fresh, leakage-resistant benchmarks in new fields.
    • Tools/Workflows: “LLM-Gated Benchmark Builder” (submission portal, gate models, similarity screens); contributor contracts and integrity monitoring; two-reviewer audit process.
    • Assumptions/Dependencies: Requires expert contributor network, budget, and governance; clear IP and confidentiality processes.
  • Bilingual scientific content production with LaTeX preservation
    • Sector: Education, publishing, EdTech, academic societies.
    • Use Case: Adopt the paper’s LaTeX-preserving MT + professional post-editing pipeline to produce bilingual problem sets, textbooks, and assessments without breaking formulas.
    • Tools/Workflows: Placeholder-preserving MT, terminology glossaries, independent QA pass.
    • Assumptions/Dependencies: Access to bilingual editors and domain reviewers; MT quality varies by language pair.
  • Automated math answer judging and grading at scale
    • Sector: EdTech, online competitions, MOOCs, homework systems, computational notebooks.
    • Use Case: Use an LLM judge to compare final answers for mathematical equivalence (e.g., simplified forms, parameterizations), reducing human grading load.
    • Tools/Workflows: “Math Answer Equivalence Judge” with guardrails (unit tests, adversarial checks, uncertainty thresholds); fallbacks to CAS/formal checkers for critical items.
    • Assumptions/Dependencies: LLM judges can misclassify edge cases; requires monitoring and periodic calibration.
  • Model selection guidance for quantitative products
    • Sector: Finance, engineering, scientific computing, academic tooling.
    • Use Case: Choose model families based on observed strengths—e.g., closed models for research-level math; GLM-5-like systems for refusal behavior—until open models close the gap.
    • Tools/Workflows: Internal bake-offs on SOOHAK-like sets; ensemble strategies (solver + refusal gate).
    • Assumptions/Dependencies: Performance is benchmark-specific; re-validate on in-domain tasks.
  • Human-in-the-loop team design and operations
    • Sector: Research management, competitive problem solving, consulting.
    • Use Case: Apply insights that contest-style training and tool use yield higher throughput under time pressure; organize mixed teams to maximize coverage (parallelism + cross-checking).
    • Tools/Workflows: Playbooks for division of labor, shared-room collaboration, computational verification protocols.
    • Assumptions/Dependencies: Transferability depends on domain; not all research tasks reward short-path strategies.
  • Procurement and competition-grade evaluation protocols
    • Sector: Government, defense, national AI initiatives, corporate MLOps governance.
    • Use Case: Adopt sealed, access-controlled benchmarks and evaluation-by-request services to maintain fairness and reduce leakage in model competitions and vendor selection.
    • Tools/Workflows: Embargoed datasets, third-party administers, audit trails, periodic scorecards.
    • Assumptions/Dependencies: Requires trust in evaluator neutrality; legal and contractual scaffolding.

Long-Term Applications

These applications require additional research, scaling of models, dataset release, or broader ecosystem development.

  • Certification standards for reasoning and abstention
    • Sector: Policy/regulation, compliance, model governance.
    • Use Case: Establish formal certifications requiring minimum performance on research-level reasoning and refusal (abstention) before deploying models in safety-critical domains.
    • Tools/Workflows: Standards bodies define test suites, versioned leaderboards, audit protocols.
    • Assumptions/Dependencies: Consensus on metrics; public datasets (post-embargo) and third-party auditors.
  • Training paradigms for robust abstention and problem checking
    • Sector: AI research, safety, agent systems.
    • Use Case: Develop training objectives that reward diagnosing ill-posedness, missing assumptions, and contradictions; integrate self-checkers and verifier loops into agents.
    • Tools/Workflows: Refusal-focused RLHF/RLAIF; synthetic generation of ill-posed variants; verifier-critic architectures; uncertainty calibration.
    • Assumptions/Dependencies: Need scalable datasets across domains; avoid over-conservative agents.
  • Semi-autonomous mathematical discovery assistants
    • Sector: Academia (mathematics, theoretical CS, physics), R&D labs.
    • Use Case: Build agents that propose conjectures, search literature, synthesize “folklore-level” reasoning, and check well-posedness—guided by benchmarks like SOOHAK as progress bars.
    • Tools/Workflows: Retrieval + tool-use + proof-checking; collaboration with human mathematicians; long-context reasoning.
    • Assumptions/Dependencies: Frontier models must exceed current Challenge performance; reliable formal verification.
  • Cross-domain “SOOHAK-like” benchmarks (law, medicine, engineering)
    • Sector: Healthcare, legal tech, engineering, energy.
    • Use Case: Create domain-specific research-level benchmarks with refusal splits (e.g., incomplete clinical orders, ill-posed legal queries) to improve reliability of domain copilots.
    • Tools/Workflows: Expert-authored items, LLM gating, reviewer audits, bilingual release for global use.
    • Assumptions/Dependencies: Recruiting experts at scale; funding and governance akin to SOOHAK.
  • Next-generation educational platforms and assessments
    • Sector: Education/EdTech.
    • Use Case: Post-release, integrate SOOHAK and Mini into adaptive learning systems, olympiad training, and graduate curricula; auto-grading with equivalence judges, hints generated via safe chains-of-thought with guardrails.
    • Tools/Workflows: Curriculum-aligned tagging (MSC), progressive difficulty ladders, explanation verification.
    • Assumptions/Dependencies: Public availability; validated grading reliability and bias controls.
  • Benchmark longevity and anti-contamination ecosystems
    • Sector: AI evaluation infrastructure.
    • Use Case: Maintain “living” research benchmarks with rotating embargoed pools, access controls, and contamination monitors; standardized APIs for evaluation-by-request.
    • Tools/Workflows: Data escrow, leakage scanners, provenance tracking, contributor reputation systems.
    • Assumptions/Dependencies: Sustainable funding; community norms on access and reporting.
  • Agent team orchestration informed by human-team findings
    • Sector: Software agents, collaborative AI.
    • Use Case: Construct multi-agent systems with complementary profiles (fast solvers, verifiers, tool specialists), mirroring human teams’ division of labor and cross-checking.
    • Tools/Workflows: Role-specialized agents, adjudication layers, resource allocation policies under token/time budgets.
    • Assumptions/Dependencies: Reliable routing and verification; cost–latency trade-offs.
  • Corporate risk metrics and dashboards anchored in refusal performance
    • Sector: Enterprise AI governance, finance, healthcare providers.
    • Use Case: Track “refusal adequacy” and “over-refusal” as governance KPIs across business units deploying AI; tie to incident reporting and red-teaming.
    • Tools/Workflows: Continuous evaluation pipelines, domain-specific refusal canaries, drift monitoring.
    • Assumptions/Dependencies: Alignment with regulatory expectations; data privacy in evaluation logs.
  • IP and contributor frameworks for high-quality expert data
    • Sector: Publishing, data marketplaces, foundation model training.
    • Use Case: Scale the NDA/IP-transfer and integrity pipeline to source difficult, original training/evaluation data ethically and securely.
    • Tools/Workflows: Contract templates, anti-AI-submission detection, compensation models that reward quality beyond raw difficulty.
    • Assumptions/Dependencies: Legal compliance across jurisdictions; fair compensation norms.
  • Formal-methods-integrated graders and verifiers
    • Sector: Software verification, math/engineering computation tools.
    • Use Case: Hybrid LLM judge + formal checker systems for grading and proof validation, reducing reliance on LLM judgment alone.
    • Tools/Workflows: CAS/prover integration, counterexample search, proof object generation.
    • Assumptions/Dependencies: Coverage gaps in formal tools; computational cost and usability.

These applications leverage SOOHAK’s core contributions: a hard, contamination-resistant benchmark for research-level math; an explicit refusal evaluation target; a scalable, integrity-focused data pipeline; bilingual LaTeX-preserving production; and evaluation methodologies (multi-sample scoring, LLM judging, compute scaling). As the dataset is publicly released and models advance, many long-term applications will become more feasible and impactful.

Glossary

  • Abelian variety: A complete algebraic variety with a group structure; a projective complex torus that is also an algebraic variety. "including tags such as auto- morphism, abelian variety, Fano variety, Kazhdan-Lusztig polynomials, moduli space, Richardson varieties, Barratt-Eccles operad, and homotopical algebra."
  • Access control: Restricting access to benchmark items to reduce data leakage or contamination. "the most recent generation of benchmarks responds to leakage by withholding problems behind access control [24, 25, 22, 1]"
  • Automorphism: A structure-preserving bijection from a mathematical object to itself. "including tags such as auto- morphism, abelian variety, Fano variety, Kazhdan-Lusztig polynomials, moduli space, Richardson varieties, Barratt-Eccles operad, and homotopical algebra."
  • Avg@3: An evaluation metric averaging correctness over three sampled responses per question. "Gemini-3-Pro leads Challenge with Avg@3 of 30.39, followed by GPT-5 at 26.37."
  • Barratt-Eccles operad: A classical E∞ operad built from symmetric groups, modeling homotopy-coherent commutative structures. "including tags such as auto- morphism, abelian variety, Fano variety, Kazhdan-Lusztig polynomials, moduli space, Richardson varieties, Barratt-Eccles operad, and homotopical algebra."
  • Brieskorn homology 3-sphere: A 3-manifold with the same homology as the 3-sphere, defined as a link of an isolated complex surface singularity. "Let Z(3, 5, 49) be the Brieskorn homology 3-sphere"
  • Contractible: A space homotopy equivalent to a point; it can be continuously shrunk to a point. "showing that the homology 4-ball is contractible and admits a Mazur-type handle structure."
  • Contamination: Undesired overlap between benchmark items and model training data, inflating performance. "While human-authoring fresh problems sidesteps contamination, such efforts are typically confined to a single mathematical area"
  • Fano variety: An algebraic variety with ample anticanonical bundle; central in classification theory. "including tags such as auto- morphism, abelian variety, Fano variety, Kazhdan-Lusztig polynomials, moduli space, Richardson varieties, Barratt-Eccles operad, and homotopical algebra."
  • Folklore-level reasoning: Reasoning relying on widely known but unpublished mathematical facts and heuristics among experts. "what they termed folklore-level reasoning."
  • Frontier model: A cutting-edge, top-performing LLM at the current capability frontier. "frontier models including Gemini-3-Pro, GPT-5, and Claude-Opus-4.5 reach 30.4%, 26.4%, and 10.4% respectively"
  • Hallucination: Model behavior where confident but unsupported or incorrect content is produced. "what governs refusal and hallucination behavior we leave to future work."
  • Homotopical algebra: The use of homotopy-theoretic methods to study algebraic structures and categories. "including tags such as auto- morphism, abelian variety, Fano variety, Kazhdan-Lusztig polynomials, moduli space, Richardson varieties, Barratt-Eccles operad, and homotopical algebra."
  • Integral homology 4-ball: A 4-manifold whose integral homology is the same as that of the standard 4-ball. "E(3, 5, 49) bounds an integral homology 4-ball, careful use of Kirby calculus may strengthen the construction"
  • Kazhdan-Lusztig polynomials: Polynomials associated with Coxeter groups and Hecke algebras, encoding deep representation-theoretic information. "including tags such as auto- morphism, abelian variety, Fano variety, Kazhdan-Lusztig polynomials, moduli space, Richardson varieties, Barratt-Eccles operad, and homotopical algebra."
  • Kirby calculus: A set of moves on framed links used to study 3- and 4-manifolds via handle decompositions. "careful use of Kirby calculus may strengthen the construction"
  • Kirby diagram: A link diagram encoding a handle decomposition of a 4-manifold, used in Kirby calculus. "solving the problem likely requires effective Kirby-diagram manipulation."
  • Leakage: Exposure of benchmark items (or their answers) in training or public sources, compromising evaluation integrity. "the most recent generation of benchmarks responds to leakage by withholding problems behind access control"
  • LLM judge: A LLM used to evaluate equivalence or correctness of answers from other models. "we use GPT-5-Mini as an LLM judge that compares the parsed answer to the gold answer via mathematical equivalence."
  • Mathematics Subject Classification (MSC): A standardized taxonomy for categorizing mathematical literature and topics. "we assign each question to a Mathematics Subject Classification (MSC) subject area"
  • Mazur-type handle structure: A handle decomposition reminiscent of Mazur manifolds, often implying exotic or subtle smooth structures. "admits a Mazur-type handle structure."
  • Moduli space: A parameter space classifying objects (up to isomorphism) of a given type, often with geometric structure. "including tags such as auto- morphism, abelian variety, Fano variety, Kazhdan-Lusztig polynomials, moduli space, Richardson varieties, Barratt-Eccles operad, and homotopical algebra."
  • Open-weight model: A model whose weights are publicly available for download and fine-tuning. "leading open-weight models such as Qwen3-235B, GPT-OSS-120B, and Kimi-2.5 remain below 15%."
  • Pass@3: The probability that at least one of three sampled responses is correct. "Pass@3 across the Qwen3 family (0.6B to 32B) on Challenge (blue) and Refusal (orange);"
  • Refusal subset: A test split designed to assess whether a model detects ill-posed prompts and refrains from answering. "introduces a refusal subset that probes a capability intrinsic to research mathematics"
  • Richardson varieties: Intersections of Schubert and opposite Schubert varieties within flag varieties, important in geometric representation theory. "including tags such as auto- morphism, abelian variety, Fano variety, Kazhdan-Lusztig polynomials, moduli space, Richardson varieties, Barratt-Eccles operad, and homotopical algebra."
  • Smooth embedding: A differentiable injective immersion that is a homeomorphism onto its image between manifolds. "every closed orientable 3-manifold smoothly embeds in #"-1(S2 x S2)"
  • Sovereign AI Foundation Model: A national initiative to develop and evaluate domestic AI foundation models. "The full collection was developed for the South Korean Ministry of Science and ICT MSIT 'Sovereign AI Foundation Model' project"
  • Test-time compute: The computational budget (e.g., tokens, reasoning steps) allocated when generating model outputs. "Challenge scales roughly linearly with both train- and test-time compute; Refusal does not."
  • Train-time compute: The computational resources spent during model training (e.g., parameter count, training steps, FLOPs). "Challenge scales roughly linearly with both train- and test-time compute; Refusal does not."

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 9 tweets with 419 likes about this paper.