Papers
Topics
Authors
Recent
Search
2000 character limit reached

Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems

Published 29 Jan 2026 in cs.AI, math.CO, and math.NT | (2601.22401v3)

Abstract: We present a case study in semi-autonomous mathematics discovery, using Gemini to systematically evaluate 700 conjectures labeled 'Open' in Bloom's Erdős Problems database. We employ a hybrid methodology: AI-driven natural language verification to narrow the search space, followed by human expert evaluation to gauge correctness and novelty. We address 13 problems that were marked 'Open' in the database: 5 through seemingly novel autonomous solutions, and 8 through identification of previous solutions in the existing literature. Our findings suggest that the 'Open' status of the problems was through obscurity rather than difficulty. We also identify and discuss issues arising in applying AI to math conjectures at scale, highlighting the difficulty of literature identification and the risk of ''subconscious plagiarism'' by AI. We reflect on the takeaways from AI-assisted efforts on the Erdős Problems.

Summary

  • The paper introduces a semi-autonomous system using Gemini to generate and filter candidate solutions for 700 Erdős problems.
  • It employs a multi-layered verification process combining AI filtering and expert audits to isolate novel and correct outcomes.
  • The study highlights issues in attribution and literature consistency, emphasizing the necessity of human oversight in mathematical discovery.

Semi-Autonomous Mathematics Discovery with Gemini: A Case Study on the Erdős Problems

Introduction and Motivation

The proliferation of unresolved mathematical conjectures, particularly those left by Erdős, presents both an opportunity and a challenge for the application of LLMs in mathematical discovery. The study under discussion deploys an AI system built upon Gemini ("Aletheia") to systematically investigate 700 "Open" problems in Bloom's Erdős Problems database (2601.22401). Rather than focusing on "solving" famous long-standing conjectures, the work scrutinizes the operational interface between AI-generated mathematical reasoning, verification bottlenecks, literature obfuscation, and the sociotechnical context of mathematical research.

The motivation is twofold: first, to empirically evaluate the extent to which a state-of-the-art math research agent can autonomously resolve combinatorially many open conjectures; second, to systematically analyze the ensuing human–AI interplay required to adjudicate correctness, novelty, and scholarly attribution in the face of large-scale, potentially noisy model output. Figure 1

Figure 1: End-to-end pipeline for the semi-autonomous math discovery process, using an LLM with a natural language verifier to narrow 700 open problems to a carefully human-evaluated shortlist.

Methodology: AI Integration and Human Verification

The study adopts a hierarchical triage protocol, utilizing Aletheia to generate natural language candidate solutions, which are then filtered by an internal LLM verifier. This significantly reduces the number of candidate outputs subjected to human expert review, illustrating the practical necessity of automating preliminary plausibility screening. The process unfolds as follows:

  • Generation: Gemini/Aletheia is prompted with each of the 700 "Open" problems, producing putative solutions in natural language.
  • AI Verification: An integrated LLM-based verifier pre-filters solutions, returning 212 potentially correct responses from the original set.
  • Non-specialist Filtration: A secondary, less-expert human filter further reduces the set based on explicit errors or trivial misinterpretations, yielding 27 candidates.
  • Expert Audit: Domain experts (sometimes consulting externally) adjudicate correctness, novelty, and intent, ultimately refining the yield to 13 meaningfully correct resolutions, of which only 5 are arguably novel and autonomous.

Notably, strong discipline is exercised against misleading attributions or uncritical claims of "AI proof," with rigorous emphasis placed on literature cross-examination to mitigate accidental "AI plagiarism"; that is, LLMs restating inherited unpublished or obscure arguments without proper citation.

Results: Taxonomy and Analysis

The 13 positive outcomes fall into four operational categories:

  • Autonomous Resolution: 5 problems solved via substantive, seemingly new argumentation by Aletheia.
  • Partial Solution: 3 problems where only a substatement or one part of a multipart problem was addressed.
  • Independent Rediscovery: 3 problems where the solution was subsequently traced to pre-existing literature, indicating the model essentially "rediscovered" a non-digitally codified result.
  • Literature Identification: 5 problems where Aletheia (correctly) identified that the problem was misclassified as open due to obscurity of existing solutions, not real mathematical difficulty.

A critical quantitative finding is that only 6.5%\sim6.5\% of the scrutinized candidates yielded meaningfully correct new results; while 31.5% were technically correct, most were either duplicates or stemmed from misinterpretation of the problem's intent, particularly due to ambiguous or corrupt statements in the database.

Notable Thematic Outcomes

  • Obscurity vs. Difficulty: The study confirms that many "open" problems in Erdős-type lists are artifacts of bibliographical drift or transcription errors rather than unsolved mathematical hardness.
  • Bottleneck Shift: The human bottleneck migrates from direct solution verification to literature archaeology and precise interpretation of legacy notation and intent.
  • Subconscious Plagiarism Risk: LLMs are susceptible to regurgitating proof ideas seen in pretraining, raising acute issues of attribution and scientific credit, especially when dealing with combinatorially generated extrapolations of the literature.

Detailed Example: Autonomous Solution and Error Typology

One of the more substantive autonomous solutions (Erdős-1051) addresses the irrationality of a certain series for lacunary integer sequences. The proof structure leverages a diagonal argument reminiscent of Mahler's method, and has been formalized and generalized in subsequent follow-up work. While the procedural soundness is validated, minor misapplications of quantitative bounds and improper handling of constants (stemming from model hallucinations or shallow citation heuristics) required manual correction by the authors before the result could be professionally communicated.

Other cases highlighted serve as exemplars of the ease with which structural database ambiguities (e.g., notational variance in "convolution" problems or ostensible generalizations of known results) can spuriously inflate apparent AI performance. In several instances, Aletheia's "solution" was merely an immediate reduction to a classic or overlooked reference, further reinforcing the necessity for domain-aware filtering and contextualization.

Implications for Mathematical AI Systems

This study suggests that LLM-based math agents, while reaching a threshold of operational utility, are still deeply reliant on tightly coupled human adjudication—particularly for high-precision research tasks where novelty and reliable attribution are at a premium. Practical challenges include:

  • Automated Literature Integration: Current LLMs are limited in their ability to dynamically search and attribute unresolved components, leading either to redundancy or unintentional paraphrase of established but non-digitally indexed results.
  • Problem Formulation Ambiguity: Many benchmarks in mathematical AI suffer from poorly policed or ambiguous statements, necessitating close collaboration with domain experts to crystallize precise (formalizable) semantics.
  • Community Coordination: Efforts to crowdsource or "wiki-fy" updates to problem repositories (as illustrated by Terence Tao's involvement post-AI progress) are critical for accurate communal tracking of genuine advances.

Prospective Developments in AI-Augmented Mathematics

Looking forward, the study underscores the importance of hybrid architectures blending scalable, automated LLM reasoning with high-precision tasks for literature graph exploration, interactive formal verification (where appropriate), and automated provenance tracking. Future work may be profitably directed towards:

  • Better AI Transparency: Developing systems that reliably signal when a proof approach, construction, or lemma is directly recycled from training data rather than genuinely synthesized.
  • Meta-Verification Workflows: Building frameworks that not only check local correctness, but explicitly map candidate solutions onto the lattice of extant results, rationalizing "novelty" and preventing the inflation of AI research claims.
  • Refined Evaluation Protocols: Updating benchmark standards for open problem lists to account for the (often considerable) lag in propagation of solved status, and embedding AI in problem curation as well as solution attempts.

Conclusion

This case study on semi-autonomous mathematics discovery via Gemini demonstrates clear proof-of-concept capability for LLM-based research agents in harvesting "low-hanging fruit" among classic open problems. The process does not trivialize deep mathematics, but refocuses the labor bottleneck from brute-force proof search to nuanced, context-sensitive expert assessment. Most strikingly, the observed challenges in literature identification, interpretation of legacy statements, and proper attribution foreground the present limitations of automated mathematical discovery and argue for conservative claims regarding near-term autonomy in AI mathematical research.

The integration of LLMs into mathematical workflow should thus be seen as an accelerator of expert time—particularly in attention-starved subfields—rather than as a stand-alone agent of discovery. The responsible deployment of such systems requires transparent workflows, robust auditing (to prevent subconscious plagiarism and attribution lapses), and close cooperation with active research communities.

Paper to Video (Beta)

To view this video please enable JavaScript, and consider upgrading to a web browser that supports HTML5 video.

Whiteboard

Explain it Like I'm 14

What this paper is about

This paper tells the story of using an AI system to help solve and check famous math questions called Erdős problems. The team built a math “research agent” on top of Google’s Gemini called Aletheia. They asked it to look at hundreds of problems listed as “Open” (unsolved) on a public website and to propose solutions. Then human mathematicians carefully checked which AI answers were actually correct and actually new.

In short: the paper is a case study of how AI and people can work together to search for math discoveries—and what goes right and wrong along the way.

What questions the researchers asked

The team focused on simple, practical questions:

  • Can an AI help scan a large list of “open” math problems and find real progress quickly?
  • If the AI suggests answers, how many are truly correct and genuinely new?
  • What are the biggest bottlenecks: making the AI reason correctly, or checking if an answer already exists in old papers?
  • What problems show up when using AI at scale for math (like misreading a problem or unintentionally copying old work)?

How they did it (methods, in everyday terms)

Think of the process like searching for treasure on a big beach:

  1. AI as a metal detector:
    • The AI agent (Aletheia) read 700 “Open” Erdős problems and tried to produce solutions.
    • It also used a built-in “verifier” in plain language to judge whether its own answers might be right. This helped filter out obviously weak guesses so humans wouldn’t have to check everything.
  2. Human experts as the treasure inspectors:
    • Out of 700 problems, 212 AI answers looked promising.
    • Human mathematicians reviewed these to remove wrong ones, ending with 27 top candidates.
    • Domain experts then checked each carefully for two things: was the math correct, and was the result actually new (not already published)?
  3. Why not use only formal proof checkers (like Lean)?
    • Very little of advanced math is written in those systems yet, so the AI would have too few tools.
    • Many problem statements on the website were imperfect or ambiguous—humans had to interpret what the original mathematician likely meant.

Analogy: The AI is good at quickly finding “interesting lumps of metal,” but humans must still decide if a find is gold, a known coin, or just a bottle cap.

What they found (key results)

  • Out of 700 “Open” problems:
    • 212 AI responses looked possibly correct.
    • After a first pass by humans: 27 remained.
    • Careful expert checking found 63 technically correct solutions overall, but only 13 that truly solved the intended problems in a meaningful way.
    • In percentages (for 200 clearly judged candidates): about 68.5% were flawed, 31.5% technically correct, and 6.5% meaningfully correct.
  • Those 13 “meaningfully correct” split into four types:
    • Autonomous Resolution: the AI produced what seems to be the first correct solution (2 problems).
    • Partial AI Solution: the AI correctly solved part of a multi-part problem (3 problems).
    • Independent Rediscovery: the AI found a correct solution that humans later found already existed in the literature (3 problems).
    • Literature Identification: the AI discovered the problem was already solved in past papers, even though it was marked “Open” (5 problems).
  • A surprising theme: many “Open” problems were actually already solved or poorly worded. Some were marked open mainly because the solution was hard to find in the literature, not because the problem was truly difficult.
  • Another notable result: the AI solved one problem (Erdős-1051) in a way the authors think is a modest but genuine autonomous advance. This later led to a broader generalization in further work.
  • Biggest bottleneck: checking if a solution was new. It took more time to search the old literature and compare ideas than to check the math itself. The team also worried about “subconscious plagiarism,” where the AI might re-create something it “saw” during training without citing it.
  • Mismatch problems: around 50 AI solutions were technically correct for the literal wording but didn’t match what Erdős originally intended (for example, because of missing conditions or ambiguous definitions). These are mathematically pointless even if technically true.

Why this matters: The AI is good at quickly surfacing possible answers, but human care is essential to confirm correctness, intended meaning, and novelty.

Why it’s important (implications and impact)

  • The good news:
    • AI can help with “attention bottlenecks”—scanning many problems, suggesting ideas, and flagging promising cases faster than humans can.
    • It can find “low-hanging fruit”: simple cases that experts might not have time to check.
  • The caveats:
    • A lot of AI answers are wrong, off-target, or not new. Human oversight remains crucial.
    • Formal proof tools don’t fix the biggest issues here: vague problem statements, missing context, and checking for prior work.
    • The risk of uncredited reuse (“subconscious plagiarism”) is real and needs careful handling.
    • Social media hype can exaggerate the importance of routine or already-known results.
  • The path forward:
    • Improve problem databases with clear definitions and references.
    • Build better AI tools to search and attribute prior work.
    • Keep humans responsible for writing and crediting, so accountability and accuracy stay high.
    • Be transparent about both successes and failures to avoid misleading claims.

Bottom line

This paper shows that AI can be a helpful teammate in math research, especially for triaging lots of problems. It can spotlight ideas worth a human’s time. But it doesn’t replace careful human judgment. The biggest wins came not from “superhuman flashes of genius,” but from speeding up search and filtering—while the biggest headaches were checking the literature and making sure the problems were interpreted correctly. With better tools and careful collaboration, the balance could keep improving.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper surfaces several unresolved issues in methodology, evaluation, and broader practice. The following concrete gaps could guide future research:

  • Verification performance and calibration:
    • Lack of precision/recall, false-positive/false-negative rates, and calibration curves for the natural-language verifier on a labeled proof dataset.
    • No robustness evaluation against persuasive-but-incorrect proofs or adversarial inputs.
  • Pipeline transparency and reproducibility:
    • Missing operational details of Aletheia (prompt templates, verifier architecture, thresholds, sampling strategy, temperatures, number of retries, stopping criteria, compute budget).
    • No ablation studies comparing variants (with/without verifier, different LLMs/verifiers, retrieval on/off, sampling budgets).
  • Human-in-the-loop evaluation:
    • Unquantified person-hours at each stage, throughput, and triage efficiency; no cost–benefit analysis.
    • No inter-rater agreement metrics, screening guidelines, or bias audits for non-expert vs expert graders.
    • Potential false negatives from early non-expert filtering not assessed.
  • Literature identification and attribution:
    • Absence of a systematic, automated literature-search pipeline (semantic search, citation graph mining, cross-lingual retrieval) with measurable recall/precision.
    • No plagiarism/attribution audit tooling for detecting idea-level overlap between outputs and prior literature (“subconscious plagiarism”) or training-data leakage.
  • Training-data provenance and leakage:
    • No audit indicating whether Bloom’s database or relevant literature was present in model pretraining; lack of contamination assessment and its impact on claimed autonomy/novelty.
  • Definition and measurement of novelty and autonomy:
    • Ambiguous criteria for “autonomous resolution” and “meaningfully correct”; no standardized, auditable rubric for categorizing outputs.
    • No quantitative “novelty score” (e.g., proof similarity metrics to known literature, dependency depth on prior theorems).
  • Problem-statement ambiguity:
    • No standardized, machine-readable ontology for Bloom’s conventions (e.g., additive vs Dirichlet convolution) to prevent misinterpretation.
    • Missing pre-solve validation tools to flag ill-posed, ambiguous, or mistranscribed problems before solution attempts.
  • Benchmarking and datasets:
    • No public benchmark release of the 700 problems annotated with intended statements, literature status, and adjudicated labels (flawed, technically correct, meaningfully correct).
    • The 137 flawed and 50 technically-correct-but-meaningless cases are not shared as a failure-analysis dataset for community improvement.
  • Failure-mode taxonomy:
    • Lack of a detailed taxonomy and counts of error types (logical gaps, definitional confusion, misuse of literature, trivialization via misinterpretation), with targeted mitigation strategies.
  • Formal verification integration:
    • No concrete hybrid pipeline that maps natural-language arguments to formal proof sketches (e.g., Lean) to check local steps or key lemmas; no coverage statistics for what can be formalized.
    • No study of how formalization can help disambiguate intended statements vs literal transcriptions at scale.
  • Generalizability and domain coverage:
    • No per-domain breakdown of outcomes (e.g., combinatorics vs number theory vs geometry) to identify where the approach succeeds or fails.
    • No evaluation on other curated conjecture lists to test transferability beyond Bloom’s Erdős problems.
  • Comparative baselines:
    • Limited, anecdotal comparison to other systems; no controlled head-to-head evaluation across models/verifiers under identical conditions.
  • Provenance and release:
    • Proprietary model and incomplete release of logs/prompts hinder independent replication; no model card with limitations, known failure modes, and training data summary.
  • Literature gaps in solved cases:
    • For “autonomous” and “independent rediscovery” instances, no rigorous, documented literature sweep protocol (including non-English sources and preprints) to strengthen novelty claims.
  • Social and ethical protocols:
    • No formal guidelines for credit/attribution when AI outputs overlap with prior work; no community process for adjudicating disputes.
    • No communication protocol to mitigate social-media misinformation/hype around preliminary AI claims.
  • Metrics of mathematical significance:
    • Absence of standardized measures to evaluate the research value of solved problems (e.g., downstream citations, generalizations enabled, inclusion in active research lines).
  • Prevention of citation inaccuracies:
    • Errors in cited constants/theorems (e.g., in 652) expose a gap in automated citation verification and consistency checking against source texts.
  • Scope of formalization success:
    • Only one solution (1051) was formalized; no data on the resources and success rate to formalize others, or on the practical barriers encountered.
  • Selection biases:
    • The verifier-driven pruning may introduce selection bias; no analysis of how this shapes which problems are examined or “solved.”
  • Active collaboration strategies:
    • No algorithms or policies for routing candidate solutions to the right domain experts, prioritizing by impact/likelihood, or optimizing human–AI collaboration time.
  • Longitudinal impact:
    • No tracking of whether AI-driven resolutions lead to new theorems, generalizations, or shifts in research agendas over time.
  • Legal/compliance considerations:
    • Unaddressed legal risks around reproduction of training data in outputs and the boundaries of fair use in mathematical text/ideas.
  • Cross-lingual literature coverage:
    • No strategy to systematically include non-English or less-indexed sources, risking missed prior art.
  • Verifier adversarial testing:
    • No stress-testing of the verifier with crafted deceptive proofs or stylistic variants to quantify brittleness.
  • Transparency of human editing:
    • Human rewriting of proofs may alter substance; no released, versioned diffs linking raw outputs to edited text to enable independent auditing.

Practical Applications

Immediate Applications

Below are actionable applications that can be deployed or piloted today, derived from the paper’s semi-autonomous discovery workflow (LLM generation, natural-language verification, human triage, literature cross-check, and attribution scrutiny).

  • AI-assisted literature reconciliation for “open problem” databases
    • Sectors: academia, publishing, software (knowledge management)
    • What: An agent that flags “open” entries likely already resolved by scanning literature and community wikis, and suggests canonical references.
    • Tools/workflows: arXiv/CrossRef/MathSciNet/zbMATH APIs; citation graph search; semantic retrieval over proofs/lemmas.
    • Assumptions/dependencies: Access to full text; reliable metadata; careful handling of paywalled content; expert adjudication for borderline matches.
  • Natural-language verifier to down-select candidate solutions for expert review
    • Sectors: academia, software (QA), finance (policy/rule compliance), legal (case law search)
    • What: Use LLM-based verifiers to score/cluster model-generated solutions, reducing the expert review burden by focusing on high-confidence candidates.
    • Tools/workflows: verifier prompts calibrated on historical solutions; disagreement-based ensembles; review queues with rationale traces.
    • Assumptions/dependencies: Calibrated confidence; clear rubrics; audit trails; reviewers available to resolve edge cases.
  • Ambiguity and convention checker for problem statements
    • Sectors: academia, education, publishing
    • What: Automatic detection of definitional ambiguities (e.g., additive vs Dirichlet convolution) and missing hypotheses; suggests standard conventions and fixes.
    • Tools/workflows: glossary/ontology of domain conventions; promptable “definition disambiguator”; issue tickets to maintainers.
    • Assumptions/dependencies: Curated domain ontologies; maintainers to accept/merge fixes; community norms for canonical definitions.
  • Provenance and “subconscious plagiarism” auditor for AI-generated proofs
    • Sectors: academia, publishing, policy
    • What: Compare AI outputs against pretraining-era literature using embeddings and citation graphs to surface likely uncredited reproductions.
    • Tools/workflows: similarity search over proof corpora; timeline filters (pre/post training cutoff); structured attribution checklists in submissions.
    • Assumptions/dependencies: Access to training data coverage summaries or proxies; legal/policy frameworks for attribution; acceptable false-positive rates.
  • Reviewer triage dashboards for editors and problem curators
    • Sectors: publishing, academia
    • What: Rank submissions/candidates by novelty likelihood, literature overlap, and clarity of statement; route to appropriate experts.
    • Tools/workflows: novelty scores; overlap heatmaps; automated “needs human confirmation” labels; reviewer assignment assistants.
    • Assumptions/dependencies: Integration with submission systems; transparent criteria; expert availability.
  • Lightweight “meaningfully correct” vs “technically correct” evaluation protocols
    • Sectors: academia, education
    • What: Adopt the paper’s taxonomy to label outputs (flawed, technically correct, meaningfully correct), preventing hype and clarifying impact.
    • Tools/workflows: standardized review forms; inter-rater reliability checks; public dashboards for transparency.
    • Assumptions/dependencies: Community buy-in; training for reviewers; consistent definitions across domains.
  • Community wiki and errata integration for problem repositories
    • Sectors: academia, education
    • What: Continuous synchronization between problem databases and community annotations (errata, literature links, clarified statements).
    • Tools/workflows: bots that watch for new references; PR-based updates; provenance logs linking fixes to discussions.
    • Assumptions/dependencies: Moderation; contributor engagement; platform APIs.
  • Courseware and coaching for proof critique and literature search
    • Sectors: education
    • What: Tutor modules that teach students to spot hidden assumptions, mismatched conventions, and prior literature.
    • Tools/workflows: interactive exercises with instant feedback; curated example bank from mis-specified problems.
    • Assumptions/dependencies: Alignment with curricula; safe model behavior; educators to review content.
  • Internal R&D “attention bottleneck” reducers
    • Sectors: software, biotech/healthcare R&D, energy, finance
    • What: Apply the same pipeline to internal “open questions” or bug/issue trackers to surface low-hanging fruit and known fixes.
    • Tools/workflows: semantic de-duplication; literature/patent prior-art linking; risk flags for re-discovery vs novelty.
    • Assumptions/dependencies: Clean internal knowledge bases; IP policies; confidentiality-preserving retrieval.
  • Prior-art search augmentation for patent drafting and review
    • Sectors: legal/IP, industry R&D
    • What: Detect likely prior art and near-duplicates early, reducing filing risk and rework.
    • Tools/workflows: semantic prior-art search; drafting assistants that auto-cite closest art; novelty risk scoring.
    • Assumptions/dependencies: Patent corpus licensing; responsibility and liability frameworks; human attorney oversight.
  • Editorial disclosure and accountability templates for AI use
    • Sectors: academia, publishing, policy
    • What: Standard submission sections for AI contributions, raw logs, and literature checks to ensure accountability.
    • Tools/workflows: AI-use checklists; attached reasoning traces; repository links for raw outputs.
    • Assumptions/dependencies: Journal policies; storage and privacy norms; researcher compliance.
  • Benchmarks for scientific “acceleration” with costs included
    • Sectors: AI research, policy
    • What: Evaluate throughput and error costs (false positives, expert-time drains), using the paper’s transparent accounting as a template.
    • Tools/workflows: time-to-verify metrics; cost-of-correction estimates; standardized reporting of negative cases.
    • Assumptions/dependencies: Community willingness to report negatives; reproducible logs; funding for evaluation.
  • Competition and olympiad training assistants with guardrails
    • Sectors: education, daily life
    • What: Tools that check student solutions for misinterpreted statements, cite related known problems, and scaffold corrections.
    • Tools/workflows: step-by-step critique; “closest known problem” linking (e.g., CMO/IMO TST); style rewrites for clarity.
    • Assumptions/dependencies: Exam integrity policies; opt-in teacher supervision; content filtering.

Long-Term Applications

These require advances in reliability, scaling, formalization coverage, or policy standardization.

  • Integrated natural-language + formal verification proof stack
    • Sectors: academia, software verification
    • What: Seamless pipeline translating natural-language drafts into formal proofs (Lean, Isabelle) while preserving intended meaning.
    • Tools/workflows: semantic alignment models; proof-synthesis with definitional convention adapters; intent-checkers.
    • Assumptions/dependencies: Much wider formal library coverage; robust disambiguation; compute and training data.
  • Universal mathematical ontology and machine-readable literature
    • Sectors: academia, publishing, software
    • What: A shared, versioned ontology of definitions/notations linked to machine-readable theorems, enabling precise retrieval and disambiguation.
    • Tools/workflows: publisher mandates for structured math markup; living glossaries; version control for definitions.
    • Assumptions/dependencies: Publisher cooperation; standards bodies; tooling for authoring and conversion.
  • Cross-domain discovery engines for science and engineering
    • Sectors: healthcare, energy, materials, biotech, robotics
    • What: Extend the triage+verification+literature-identification loop to experimental hypotheses, protocols, and simulations.
    • Tools/workflows: structured experiment ontologies; lab notebook ingestion; simulator-in-the-loop verification.
    • Assumptions/dependencies: High-quality datasets; causal modeling; safety/ethics frameworks for suggestions.
  • Novelty and attribution detectors embedded in peer review
    • Sectors: publishing, policy
    • What: Routine novelty screens that consider embeddings over literature, pretraining corpora, and preprints to deter redundant publications.
    • Tools/workflows: rolling-window novelty indexes; temporal filters; author-facing “attribution suggestions.”
    • Assumptions/dependencies: Access to broad corpora; privacy-respecting training data provenance; community norms.
  • Trustworthy AI research agents with auditable reasoning
    • Sectors: academia, policy, software
    • What: Agents that maintain verifiable chains-of-thought, cite sources, and provide uncertainty and risk-of-plagiarism estimates.
    • Tools/workflows: cryptographic logs; structured citations; counterexample searchers; red-team protocols.
    • Assumptions/dependencies: Acceptable exposure of reasoning traces; secure logging; regulatory guidance.
  • Automated problem curation and statement repair at scale
    • Sectors: academia, education
    • What: Periodic sweeps across large repositories to detect ill-posed problems, conflicting conventions, and trivializations; propose fixes.
    • Tools/workflows: anomaly detection in solution difficulty; symbolic counterexample generators; human-in-the-loop approvals.
    • Assumptions/dependencies: Stewards for acceptance; community governance; robust false-positive controls.
  • Funding and policy triage for “low-hanging fruit” programs
    • Sectors: policy, funding agencies
    • What: AI identifies tractable problems whose solution would unblock larger agendas; supports portfolio design.
    • Tools/workflows: tractability scores; dependency graphs; cost-benefit simulations.
    • Assumptions/dependencies: Transparent criteria; avoidance of bias toward “easy” but low-impact questions; expert panels.
  • Sector-specific compliance and standards verification
    • Sectors: healthcare, energy, finance, aerospace
    • What: Map design specs or clinical protocols to standards; detect missing conditions analogous to missing hypotheses in math.
    • Tools/workflows: standards ontologies; requirements traceability; discrepancy explainers.
    • Assumptions/dependencies: Access to standards; liability frameworks; domain tuning.
  • Proof provenance and licensing registries
    • Sectors: academia, publishing, legal
    • What: Registries that timestamp AI/human contributions, track licensing of proof fragments, and enforce attribution.
    • Tools/workflows: DOIs for proof objects; contributor manifests; license checks for reused lemmas.
    • Assumptions/dependencies: Community adoption; metadata interoperability; legal recognition.
  • AI-driven editorial workflows for dynamic literature maintenance
    • Sectors: publishing
    • What: Living review articles auto-updated with new results, retractions, or refinements; editors validate and release updates.
    • Tools/workflows: change-detection pipelines; editor-in-the-loop curation; provenance diffing.
    • Assumptions/dependencies: Sustainable funding; publisher tooling; versioning norms.
  • Multimodal “spec-to-proof” systems for engineering
    • Sectors: robotics, software, hardware design
    • What: Translate natural-language specs and diagrams into verifiable properties and proof obligations, catching spec ambiguities early.
    • Tools/workflows: spec parsers; property synthesis; counterexample-guided refinement.
    • Assumptions/dependencies: Standardized spec formats; formal models of components; integration with CAD/EDA toolchains.
  • Education at scale: personalized proof-writing and literature skills
    • Sectors: education, daily life
    • What: Long-horizon programs that build students’ ability to navigate conventions, cite prior work, and distinguish “meaningful” from “technical” correctness.
    • Tools/workflows: adaptive curricula; rubric-aligned feedback; community practice with attribution norms.
    • Assumptions/dependencies: Teacher training; equitable access; rigorous evaluation of learning gains.
  • Risk and misinformation monitoring for AI-generated science
    • Sectors: policy, media, academia
    • What: Systems that detect and correct overhyped claims (e.g., “AI solved X”) by cross-checking literature and expert validation status.
    • Tools/workflows: claim-tracking knowledge graphs; status badges (unverified/replicated/published); media toolkits.
    • Assumptions/dependencies: Partnerships with platforms; responsible comms guidelines; expert networks.

These applications leverage the paper’s key insights: AI can reduce attention bottlenecks but needs calibrated verifiers, clear conventions, rigorous attribution, and human expertise. Real-world impact hinges on high-quality literature access, strong ontologies, transparent audit trails, and policies that align incentives for correctness and credit.

Glossary

  • Aletheia: A specialized math research agent built on Gemini Deep Think that autonomously explores and verifies mathematical problems. "Aletheia: a specialized math research agent."
  • Autonomous Resolution: A results category indicating an AI-produced solution without human derivation or literature sourcing. "Autonomous Resolution."
  • Bloom's Erdős Problems database: A centralized repository cataloging Erdős’s conjectures and their status. "to systematically evaluate 700 conjectures labeled `Open' in Bloom's Erd\H{o}s Problems database."
  • Dirichlet convolution: An arithmetic operation on functions over the integers defined via sums over divisors; used here in contrast with additive convolution. "e.g.,~additive versus Dirichlet convolution, strong versus weak completeness, etc."
  • Formal verification: The use of proof assistants and formal systems to mechanically verify the correctness of mathematical proofs. "An alternative approach to the evaluation problem is via formal verification, such as through the Lean language."
  • Fundamental Theorem of Arithmetic: The result that every integer greater than 1 factors uniquely into primes up to ordering. "By the Fundamental Theorem of Arithmetic, a power of 3 equals a power of 2 if and only if both exponents are zero."
  • Incidence: In combinatorial geometry, a point–curve (or point–object) pair where the point lies on the object. "An incidence is a pair (p,γ)(p,\gamma) with pPp\in P and γC\gamma\in C."
  • Lean 4: The fourth major version of the Lean proof assistant used for formalizing mathematical proofs. "The solution has been formalised in Lean 4 by Barreto,"
  • liminf: The limit inferior; the greatest lower bound of the set of subsequential limits of a sequence. "satisfying lim infnan1/2n>1\liminf_{n\to\infty} a_n^{1/2^n}>1."
  • limsup: The limit superior; the least upper bound of the set of subsequential limits of a sequence. "If 2\ell\geq 2 then islim supnQ2(n(n+1)(n+))n2\limsup_{n\to \infty}\frac{Q_2(n(n+1)\cdots(n+\ell))}{n^2}infinite?"
  • Mahler's criterion: A classical criterion related to rationality/transcendence via properties of power series or related approximations. "moving to the series tail and applying Mahler's criterion."
  • Natural language verifiers: AI components that assess the plausibility or correctness of proofs stated in natural language. "we therefore used AI-based natural language verifiers to narrow the search space"
  • Pach–Sharir theorem: An incidence bound controlling the number of point–curve incidences under certain intersection constraints. "Theorem (Pach--Sharir \cite[Theorem 1.1]{PS98})"
  • Power of a Point Theorem: A Euclidean geometry theorem relating products of segment lengths from a point to intersections with a circle. "By the Power of a Point Theorem, the product of the signed lengths of the segments from the intersection point must be equal."
  • Powerful part: For an integer n, the product of prime powers pk with k≥2 in its factorization; here denoted Q₂(n). "let Q2(n)Q_2(n) be the powerful part of nn"
  • Simple curves: Curves in the plane without self-intersections, used as objects in incidence bounds. "Let CC be a set of nn simple curves in the plane"
  • Subconscious plagiarism: The phenomenon where an AI reproduces content learned during training without explicit attribution. "susceptible to ``subconscious plagiarism'' by AI."
  • Telescoping sum: A series where consecutive terms cancel extensively, simplifying to a small number of terms. "gives a telescoping sum"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 30 tweets with 2082 likes about this paper.