ResearchMath-14k: Research-Level Math Dataset
- ResearchMath-14k is a research-grade corpus of math problems drawn from academic literature, reformulated as standalone prompts for advanced model training.
- The dataset employs a multi-agent extraction and refinement process that boosts self-containment of problem statements from 67% to over 94%.
- It features detailed status labels and a companion set of 220K reasoning trajectories to support analysis of long-horizon mathematical reasoning.
Searching arXiv for the named work and closely related resources. ResearchMath-14k is a research-grade corpus of mathematical problems drawn from the research literature and rewritten into self-contained prompts for model training and analysis. It was introduced in "ResearchMath-14K: Scaling Research-Level Mathematics via Agents" as a dataset of problems curated from source documents via a multi-agent pipeline, together with ResearchMath-Reasoning, a companion corpus of teacher trajectories on those prompts. The work positions the corpus as a response to a central bottleneck in frontier-level mathematical reasoning for LLMs: the scarcity of large-scale, training-ready datasets built from genuinely open or research-level problems rather than contest-style mathematics or small evaluation-only benchmarks (Son et al., 27 May 2026).
1. Scope and definition
In this framework, “research-level” refers to questions drawn from the research literature that are posed as open problems, conjectures, or seminar-style directions; require graduate or beyond background; and often depend on paper-local definitions and assumptions that must be inlined to form a standalone prompt (Son et al., 27 May 2026). The intended supervision signal is therefore not limited to final correctness. For many such problems, especially open ones, useful supervision lies in how a solver introduces relevant objects, tests examples, isolates tractable subproblems, and manages uncertainty.
The dataset is explicitly contrasted with two established regimes of public mathematical data. One regime is contest-style training data, including olympiad-level and lower material. The other is research-grade evaluation resources that remain small and often gated to reduce contamination. ResearchMath-14k is presented as addressing both deficits simultaneously by harvesting and rewriting thousands of open questions from the mathematical literature into self-contained prompts, and by pairing them with a large reasoning corpus (Son et al., 27 May 2026).
The resulting corpus contains problems after near-duplicate filtering from a seed set of candidates. The paper states that this makes it the largest public collection of research-level mathematical problems to date and the largest collection of research-level mathematical problems overall in the authors’ framing (Son et al., 27 May 2026). A plausible implication is that the dataset is intended not only as a benchmark substrate but also as a training resource for studying long-horizon reasoning under uncertainty.
2. Agentic construction and record schema
ResearchMath-14k is assembled from three source streams spanning papers, web resources, and workshop-style materials. The source mix is meant to capture how open problems are actually circulated in research practice.
| Source stream | Documents | Problems |
|---|---|---|
| arXiv papers explicitly listing open problems | 524 | 8,182 |
| Open-problem web pages discovered via Google | 161 | 5,331 |
| Problem-session sheets and curated lists | 548 | 7,322 |
The curation pipeline has two main agents (Son et al., 27 May 2026). The Extractor agent, described as Codex + GPT-5.5 with high reasoning effort, resolves each source to full text, screens out paywalled or non-problem documents, extracts candidate problems as verbatim quotes plus a first-pass rewrite, and jumps through the source to inline definitions and hypotheses needed for comprehension. Across the documents, it yields a mean $16.9$ questions per source, with median $10$ and maximum $358$.
The Refiner agent, described as Claude Code Opus 4.7 with medium reasoning effort, re-reads the source to inline missing definitions and notation, checks up to $10$ later citing papers to determine status, and rewrites each candidate into a self-contained research problem. It emits the final JSON record with statement, status, domain metadata, source information, and a solution field when applicable (Son et al., 27 May 2026).
Deduplication is performed by embedding problems with Qwen3-Embedding-8B and marking pairs as duplicates if either the original or refined similarity exceeds 0. Among duplicates, arXiv or paper-source versions are preferred over web-page versions; otherwise one is chosen at random. The paper characterizes this as a conservative threshold intended to separate true duplicates from same-paper near overlaps, and the raw seed set is also released (Son et al., 27 May 2026).
Each record contains a self-contained statement, a context brief, status metadata, a domain hierarchy with area and macro-subject, free-form research tags, source URL and resolved locator, and evidence quotes with page or block locators. The metadata includes 1 unique tags. Parsing and normalization preserve verbatim evidence quotes and stable locators, explicitly escape LaTeX or TeX sequences in evidence strings, and prefer plain text in rewritten statements while inlining the mathematical content needed for standalone readability (Son et al., 27 May 2026).
3. Coverage, status labels, and difficulty profile
The corpus assigns each problem one of four status labels: open, partially solved, solved, or unknown. The status distribution is as follows.
| Status | Count | Share |
|---|---|---|
| Open | 8,313 | 59.14% |
| Unknown | 2,489 | 17.71% |
| Partially solved | 2,083 | 14.82% |
| Solved | 1,171 | 8.33% |
A self-containment audit on 2 sampled records reports that refined statements were judged self-contained in 3 of cases, compared with 4 for the original extractions, a 5 percentage-point improvement. The refined statements are reported as 6 longer on average, increasing from 7 to 8 characters (Son et al., 27 May 2026). This suggests that the principal technical labor of the curation process is not merely extraction but contextualization: local notation, assumptions, and definitions are often indispensable for transforming a literature fragment into a training prompt.
Topical coverage is organized into a three-level taxonomy: area, macro-subject, and local research tags. The eleven level-one areas are Analysis/PDEs/Dynamics; Mathematical Physics; Discrete Mathematics/Combinatorics; Number Theory; Geometry/Topology; Theoretical CS; Algebra/Representation; Probability/Statistics/ML; Applied/Computational; Logic/Foundations; and Other/Cross-disciplinary. Four large areas—Analysis/PDEs/Dynamics, Mathematical Physics, Discrete Mathematics/Combinatorics, and Geometry/Topology—account for 9 problems, or 0 of the corpus. Logic/Foundations contributes 1 problems (2), and Other/Cross-disciplinary 3 (4) (Son et al., 27 May 2026).
The corpus spans 5 unique source documents after curation. The top 6 sources contribute 7 problems (8), and the top 9 sources contribute 0 (1) (Son et al., 27 May 2026). This concentration reflects the uneven availability of open-problem sources across subfields.
The paper also provides an Elo-style difficulty comparison against AceMath, AIME, HLE-Verified, and NuminaMath. Pairwise LLM judgments are collected over five datasets, with all sources starting at 2, 3, and wins, losses, and draws derived from 4 comparisons. ResearchMath-14k ranks roughly 5 Elo points above the comparison datasets on Knowledge, Novelty, and Procedural axes, which the paper interprets as evidence of qualitative hardness rather than an incremental increase in difficulty (Son et al., 27 May 2026).
Representative entries illustrate the intended granularity. One example is a self-contained statement of Birch–Swinnerton-Dyer for an elliptic curve 6, labeled open and tagged under Number Theory and Arithmetic Geometry. Another is a genus-7 rational-points problem labeled unknown and tagged under Number Theory and Diophantine Geometry (Son et al., 27 May 2026).
4. ResearchMath-Reasoning and observed model behaviors
ResearchMath-Reasoning is the companion corpus of 8 teacher trajectories generated from GPT-OSS-120B and Qwen3-30B-A3B on the 9K prompts (Son et al., 27 May 2026). Because many of the underlying problems are unsolved, the purpose of the reasoning set is not to provide correct answers in the ordinary benchmark sense. Instead, it is intended to support analysis of solver behavior and to test whether partially useful attempts can still function as supervision.
The paper reports several recurrent avoidance behaviors. In a manual review of 0 sampled trajectories, 1 are classified as non-attempts. More specifically, 2 list related references and output “open” as the answer, while 3 recognize the problem as open and either narrow the conditions or simply list references. Fabricated arXiv and PDF URLs also appear (Son et al., 27 May 2026).
Rule-based counters are then applied to traces using lexical indicators of three behaviors: abandon, assume, and cite. On ResearchMath-14k, row-hit rates across eight models are assume 4 (5), cite 6 (7), and abandon 8 (9). Lemma decomposition, judged by an agent, is rare at 0 (1) (Son et al., 27 May 2026). The combination of frequent assumption language and very low decomposition rates indicates that stylistic markers of mathematical discourse are much more common than explicit subproblem structuring.
Reference verification makes the factuality problem more concrete. An Agent-Judge verifies references in 2 traces, 3 per model over eight models. Among these traces, 4 (5) contain at least one reference-like mention, 6 (7) contain at least one fake reference, and across 8 mentions, 9 are fake ($16.9$0) (Son et al., 27 May 2026).
A central empirical finding is that newer open-weight models generate many more references and many more fabricated references per trace than older matched counterparts. The paper reports a $16.9$1 increase in references and a $16.9$2 increase in fake references per trace. The concrete comparisons are: DeepSeek R1 $16.9$3 V4-Pro from $16.9$4 to $16.9$5 references per trace and $16.9$6 to $16.9$7 fake references; Kimi K2 $16.9$8 K2.6 from $16.9$9 to $10$0 and $10$1 to $10$2; Qwen3 30B $10$3 Qwen3.5 35B from $10$4 to $10$5 and $10$6 to $10$7; Qwen3 235B $10$8 Qwen3.5 397B from $10$9 to $358$0 and $358$1 to $358$2 (Son et al., 27 May 2026). The fake mentions are described as mostly fabricated paper titles and authors, suggesting that citation-heavy post-training without tools at inference encourages stylistic citation that becomes hallucination when retrieval is unavailable.
5. Agentic filtering and fine-tuning outcomes
To mitigate these failure modes, the paper introduces an agentic filtering pipeline for ResearchMath-Reasoning (Son et al., 27 May 2026). Traces are segmented into newline-delimited blocks, a small LLM extracts reference-like spans, and a search-enabled agent verifies each span against the web. If any span is judged fake, the entire trace is removed. Formally, for a trace $358$3 with extracted spans $358$4, the trace is retained only if $358$5.
This produces a $358$6-trace subset called ResearchMath-Reasoning-Filtered. The paper argues that even without fully correct reasoning, these attempts often remain useful because they introduce relevant objects, partial arguments, or plausible reductions. The filtering is intended to remove the most harmful supervision patterns—non-attempts, unsupported assertions through “assume” language, and fabricated citations—while retaining wrong-but-reasonable trajectories (Son et al., 27 May 2026).
The training study uses LoRA fine-tuning of Qwen3 base models at $358$7B, $358$8B, and $358$9B-A3B parameters on the $10$0 filtered traces, with $10$1 traces randomly sampled from DASD-Thinking as a control. Evaluation is performed on AIME 2024–2026 ($10$2), HLE ($10$3), and SOOHAK Challenge+Mini ($10$4), using integer-only filters where applicable and math-verify scoring. The reported hyperparameters are LoRA rank $10$5, $10$6, dropout $10$7, application to attention and MLP projections, per-device batch size $10$8, global batch size $10$9 for 00B and 01B and 02 for 03B, and sequence length truncation of 04 tokens for 05B and 06B and 07 for 08B. Each configuration is averaged over three seeds (Son et al., 27 May 2026).
The main empirical result is that fine-tuning on ResearchMath-Reasoning-Filtered improves over the base models in all 09 model-by-benchmark cells, with a mean gain of 10 percentage points. Relative to DASD-Thinking, ResearchMath-Reasoning-Filtered wins in 11 cells, with the clearest gains on the research-level evaluations: an average 12 points over DASD across HLE and SOOHAK. The sole exception is AIME for the 13B model, where DASD is 14 points (Son et al., 27 May 2026). The authors interpret this as evidence that open-problem attempts can provide supervision beyond generic mathematical reasoning exposure.
The same section reports that initial attempts to fine-tune Qwen3-4B on unfiltered trajectories caused degeneration, specifically repetitive outputs and frequent non-attempts, with near-zero performance, although specific scores are not reported (Son et al., 27 May 2026). This sharp contrast is presented as the motivating case for the filtering pipeline.
6. Limitations, ethics, availability, and anticipated uses
The paper identifies several limitations. Conservative deduplication can remove distinct but closely related problems, while near-duplicates from the same source may still survive. Some refined statements may still miss subtle source-local definitions; the audit identifies 15 such cases (Son et al., 27 May 2026). Domain skew is another limitation: the four largest areas account for 16 of the corpus, and Other/Cross-disciplinary includes science-adjacent questions that may not align with narrower notions of mathematical training.
Reasoning risks are treated as both a data-quality issue and a misuse concern. Teacher trajectories frequently fabricate references, and newer models are described as more citation-heavy but less factual when tools are unavailable. The paper states that filtering mitigates, but does not eliminate, harmful supervision. It also notes that training on open problems could encourage models to overconfidently assert results or fabricate citations (Son et al., 27 May 2026).
The stated mitigations are paywall screening, verbatim evidence quotes, self-containment checks, fake-reference filtering, and public disclosure of failure modes. Documents behind paywalls are discarded before extraction, and both ResearchMath-14k and ResearchMath-Reasoning are released under the MIT License (Son et al., 27 May 2026).
The dataset is hosted at https://huggingface.co/datasets/amphora/ResearchMath-14k. The released artifacts include the raw seed set, the deduplicated final problem set, the 17K reasoning corpus, and the filtered 18K subset used for training. Records are distributed as JSON objects with statements, status labels, domain taxonomy, source URLs and locators, and evidence quotes. No predefined train, validation, or test splits are supplied; users are expected to define splits appropriate to their training regime (Son et al., 27 May 2026).
Recommended uses are supervised fine-tuning for research-level reasoning, behavior analysis of decomposition and citation quality, and agent development for retrieval and verification. Future directions proposed in the paper include scaling the filtering budget, combining fake-reference removal with non-attempt detection and unsupported-assumption gating, incorporating retrieval tools at inference, designing agentic RL that rewards lemma decomposition and skeptical testing, broadening underrepresented areas, and exploring semi-automated status updates as problems become partially or fully solved (Son et al., 27 May 2026).
7. Terminological ambiguity and disambiguation
The name “ResearchMath-14k” has an unrelated use in algebraic topology and computational topology. In "Random Simple-Homotopy Theory," the term “ResearchMath-14k” is used for a family 19 of triangulated Bing’s houses with 20 rooms, characterized for 21 by 22 vertices and 23, together with a uniform six-expansion certificate for collapse to a point (Benedetti et al., 2021). That usage refers to a simplicial-complex family, not to a machine-learning dataset.
This naming collision is bibliographically relevant because searches for “ResearchMath-14k” can retrieve both the LLM-oriented corpus and the topological construction. In current usage, the dataset introduced in 2026 and the 24 family from simple-homotopy theory are distinct entities that share only the surface form of the label (Benedetti et al., 2021).