Logics-STEM-SFT-Dataset Overview

Updated 4 July 2026

Logics-STEM-SFT-Dataset is a large-scale text-based STEM reasoning corpus featuring 10M full instances and a 2.2M stratified subset with rich metadata.
It employs a five-stage curation pipeline—including annotation, deduplication, decontamination, distillation, and stratified sampling—to ensure high-quality, diverse data.
The dataset is integral to failure-driven post-training, leading to measurable model performance gains on external STEM benchmarks.

Logics-STEM-SFT-Dataset is a large-scale supervised fine-tuning corpus released with the reasoning model Logics-STEM. It consists of question–chain-of-thought–answer triples for STEM reasoning, with a scale of approximately 10 million instances in the full release and 2.2 million instances in a stratified downsample. Each record is text-only, with images excluded, and is annotated with metadata such as domain, educational level, and answer type. The dataset was constructed through a five-stage curation pipeline—annotation, deduplication, decontamination, distillation, and stratified sampling—and is used within a failure-driven post-training framework that combines targeted retrieval and synthetic data generation to improve reasoning performance on external STEM benchmarks (Xu et al., 4 Jan 2026).

1. Dataset identity and scope

Logics-STEM-SFT-Dataset is a training corpus, not a benchmark with internal evaluation splits. The release includes two versions: Logics-STEM-SFT-10M, containing approximately 10 million question–CoT–answer triples, and Logics-STEM-SFT-2.2M, a stratified downsample retaining 2.2 million instances. The stated purpose is post-training for reasoning models in STEM domains, especially for supervised fine-tuning and subsequent second-stage SFT or reinforcement learning (Xu et al., 4 Jan 2026).

The scope is broader than mathematics alone. The metadata explicitly covers Math, Physics, Chemistry, Biology/Medicine, Computer Science, and Engineering, and the educational levels range from elementary to competition. This places the dataset between two established traditions. On one side are multimodal STEM assessment datasets such as the benchmark in "Measuring Vision-Language STEM Skills of Neural Models," which emphasizes K–12 vision-language multiple-choice questions with train/valid/test splits (Shen et al., 2024). On the other side are formal logical reasoning resources, such as automatically generated first-order logic problems in Zermelo–Fraenkel set theory (Ibragimov et al., 20 Feb 2025) and synthetic rule-learning corpora in Datalog form (Cornelio et al., 2019). Logics-STEM-SFT is distinct in being a large-scale textual reasoning corpus with distilled chain-of-thought, rather than a pure benchmark or a narrowly formal-logic generator.

A common misconception is that the dataset is simply a benchmark under the Logics-STEM name. The paper instead defines it as a release for training, with all downstream evaluation done on external STEM benchmarks rather than on an internal held-out split (Xu et al., 4 Jan 2026).

2. Scale, composition, and record structure

The release is organized around two scales and a fixed record schema.

Release	Scale	Notes
Logics-STEM-SFT-10M	~10 million	Full training corpus
Logics-STEM-SFT-2.2M	2.2 million	Stratified downsample
2.2M subset composition	~1.05 M / ~1.14 M	Pure mathematics / broader STEM

Each example contains the following fields: the original question, a chain_of_thought consisting of a multi-step, teacher-distilled reasoning trace ending in a boxed final answer, the extracted answer, and a meta object containing annotations for domain, educational level, and answer type. The release format is compressed JSONL, sharded for parallel I/O, with index files provided for random access (Xu et al., 4 Jan 2026).

The paper gives a schema example of the form:

$Q_p$ 4

The metadata taxonomy is central to the dataset’s reuse. Domains distinguish mathematics from broader STEM; educational level ranges over elementary, junior-secondary, senior-secondary, undergraduate, graduate, competition; and answer types include Boolean, multiple-choice, numeric, vector/matrix, interval, expression, string, proof, explanation, other. The full corpus therefore supports stratification by both disciplinary content and response modality (Xu et al., 4 Jan 2026).

Several distributional statistics are reported. After deduplication and decontamination, the corpus size is approximately 10 M. In the 2.2M subset there are approximately 1.05 M pure mathematics samples and 1.14 M broader-STEM samples. Chain-of-thought lengths range from 10–200+ tokens, with median ~50, and the average prompt-plus-CoT context is ~600 tokens. The paper further states that lexical diversity (type–token ratio) is high, reflecting multi-disciplinary vocabulary (Xu et al., 4 Jan 2026).

3. Five-stage curation pipeline

The dataset is defined operationally by a five-stage curation engine.

Stage 1: Annotation. Raw questions are processed by Qwen3-235B-Instruct to annotate validity or unambiguity, discipline, educational level, answer type, and verifiable answer status. Questions that are unsolvable, incomplete, or missing-image items are filtered out. The discipline labeling includes 6 buckets: Math; STEM w/o Math; Humanities/Social Science; Business; Medicine/Biology; Code. Invalid or ambiguous questions are discarded (Xu et al., 4 Jan 2026).

Stage 2: Deduplication. Exact duplicates are removed using MD5(question) fingerprints. Near duplicates are removed using MinHash with 24 bands of width 10 to approximate Jaccard. Within each MinHash bucket, only one example is kept, with priority given to samples with verifiable answers. This combination is designed to reduce both literal repetition and near-paraphrastic overlap (Xu et al., 4 Jan 2026).

Stage 3: Decontamination. Training samples are removed if they either share a MinHash bucket with any held-out evaluation example or contain an identical 13-gram with such an example. The paper describes this as a dual check that drives overlap to near zero (Xu et al., 4 Jan 2026).

Stage 4: Response Distillation. Chain-of-thought traces are generated by Qwen3-235B-A22B-Thinking-2507 using the configuration temperature = 0.6, top_k = 20, top_p = 0.95, max_context_length = 32768. CoTs whose n-gram duplication ratio exceeds a preset threshold are discarded. The paper notes that optional answer-verification via math-verify is available, but also states that erroneous CoTs are retained in small quantities since aggressive filtering was shown to degrade model robustness (Xu et al., 4 Jan 2026).

Stage 5: Stratified Sampling. The proxy for difficulty is the token length $l$ of the distilled CoT. Let $Q_p$ denote the $p$ -th percentile of the $\{l_i\}$ . The sampling weight is defined as

$w(l)= \begin{cases} 1.0, & l\ge Q_{0.75},\ 0.5, & Q_{0.50}\le l<Q_{0.75},\ 0.1, & Q_{0.20}\le l<Q_{0.50},\ 0.0, & l<Q_{0.20}. \end{cases}$

The 2.2M release is then sampled with probability $P_i \propto w(l_i)$ (Xu et al., 4 Jan 2026).

This pipeline indicates that “quality” is operationalized not only as correctness and cleanliness, but also as diversity, benchmark isolation, and controlled retention of harder reasoning traces.

4. Quality controls, release conventions, and what the dataset is not

The release provides a training corpus without internal val/test splits. The paper is explicit that all downstream evaluation is done on external STEM benchmarks, including AIME2024/5, HMMT2025, BeyondAIME, GPQA-Diamond, MMLU-Pro-STEM, CMMLU-STEM, R-Bench, etc. (Xu et al., 4 Jan 2026).

Quality measurements are tied to the pipeline stages. Annotation quality is enforced by the Instruct teacher model’s built-in heuristics. Deduplication and decontamination coverage are measured by bucket-drop rates (≈67% dropped to reach 10 M). Distillation quality is monitored using n-gram repetition ratios and math-verify answer-match rates (Xu et al., 4 Jan 2026).

Several misunderstandings are corrected by these release conventions. First, the dataset is not a multimodal resource: the paper states that the question field contains the original STEM problem with images excluded. This contrasts with the earlier STEM benchmark, where every retained example has at least one image and the schema includes context_image plus possibly image-based answer options (Shen et al., 2024). Second, the dataset is not described as fully verified ground truth CoT. The paper explicitly retains a small quantity of erroneous CoTs because aggressive filtering harmed robustness (Xu et al., 4 Jan 2026).

A plausible implication is that the dataset is optimized for post-training robustness rather than for creating a clean symbolic proof corpus in the formal-methods sense.

5. Role in failure-driven post-training

Logics-STEM-SFT-Dataset is embedded in a broader data-algorithm co-design framework. The abstract attributes the model gains to joint optimization of data and algorithm to fit a “gold-standard distribution behind reasoning,” and the detailed description formalizes this with a first-stage SFT objective, a failure indicator, a failure-biased prompt distribution, a retrieval kernel, a synthesized training distribution, and a final mixture distribution (Xu et al., 4 Jan 2026).

The first-stage SFT objective is

$\theta_1 = \arg\min_{\theta} \mathbb{E}_{(x,y)\sim P_0}\Biggl[-\sum_{t=1}^T\log\pi_\theta\!\bigl(y^t\mid x,y^{<t}\bigr)\Biggr].$

The binary failure indicator is

$w_{\theta_1}(x) = \mathbf{1}\{\text{model’s final answer on }x\text{ is incorrect}\}.$

This induces a failure-biased prompt distribution

$Q_{\theta_1}(x) = \frac{Q(x)\,w_{\theta_1}(x)} {\mathbb{E}_{x\sim Q}[\,w_{\theta_1}(x)\,]}.$

Document retrieval is modeled with a top- $k$ kernel

$Q_p$ 0

The synthesized training distribution is

$Q_p$ 1

The final second-stage objective uses

$Q_p$ 2

Within this formalism, Logics-STEM-SFT supplies the initial supervised distribution $Q_p$ 3 and, by construction, the metadata needed for targeted sampling and failure analysis. The dataset is therefore not merely a corpus of solved problems; it is part of a feedback loop in which failures trigger retrieval and synthesis around failure regions (Xu et al., 4 Jan 2026).

6. Empirical effects and relation to adjacent dataset traditions

The paper reports that fine-tuning Qwen3-8B on Logics-STEM-SFT-2.2M yields measurable gains on multiple external reasoning benchmarks: AIME2024 @Pass@1: 76.0% → 80.62%, AIME2025: 67.3% → 73.33%, GPQA-Diamond: 62.0% → 72.70%, and MMLU-Pro-STEM: 83.02% → 85.20% (Xu et al., 4 Jan 2026). These results locate the dataset within contemporary post-training practice: it is intended to change model behavior on out-of-distribution evaluations rather than to serve as a closed-form testbed.

In the wider research landscape, Logics-STEM-SFT occupies a different niche from at least three adjacent resources. The multimodal STEM benchmark of 1,073,146 questions spans 448 skills and emphasizes vision-language K–12 evaluation with fixed train/valid/test partitions (Shen et al., 2024). The first-order logic datasets generated in Zermelo–Fraenkel set theory emphasize truth/falsity judgments on prenex formulas whose difficulty is controlled by graph parameters and negation structure (Ibragimov et al., 20 Feb 2025). RuDaS, by contrast, generates synthetic Datalog rule-learning corpora and evaluates learned rules using Herbrand-based measures, IR measures, and R-score (Cornelio et al., 2019).

This comparison clarifies what “Logics” means in Logics-STEM-SFT. The paper does not define the dataset as a purely formal-logic corpus. Instead, it combines large-scale open-source data with carefully designed synthetic data and distills long reasoning traces across mathematics and broader STEM (Xu et al., 4 Jan 2026). A plausible implication is that the dataset’s “logical” character lies in post-training for multi-step reasoning, not in restricting the corpus to symbolic logic instances.

From an encyclopedic perspective, the defining properties of Logics-STEM-SFT-Dataset are therefore its 10M-scale text-only reasoning triples, its five-stage curation pipeline, its metadata-rich stratified downsampling, and its explicit integration into failure-driven post-training for STEM reasoning models (Xu et al., 4 Jan 2026).